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Elliptical Insights: Understanding 
Statistical Methods through Elliptical 
Geometry 

Michael Friendly, Georges Monette and John Fox 

Abstract. Visual insights into a wide variety of statistical methods, for both 
didactic and data analytic purposes, can often be achieved through geomet- 
ric diagrams and geometrically based statistical graphs. This paper extols 
and illustrates the virtues of the ellipse and her higher-dimensional cousins 
for both these purposes in a variety of contexts, including linear models, mul- 
tivariate linear models and mixed-effect models. We emphasize the strong 
relationships among statistical methods, matrix-algebraic solutions and ge- 
ometry that can often be easily understood in terms of ellipses. 

Key words and phrases: Added-variable plots, Bayesian estimation, con- 
centration ellipse, data ellipse, discriminant analysis, Francis Galton, hy- 
pothesis-error plots, kissing ellipsoids, measurement error, mixed-effect mod- 
els, multivariate meta-analysis, regression paradoxes, ridge regression, sta- 
tistical geometry. 
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1. INTRODUCTION 

Whatever relates to extent and quantity 
may be represented by geometrical figures. 
Statistical projections which speak to the 
senses without fatiguing the mind, possess 
the advantage of fixing the attention on a 
great number of important facts. 

Alexander von Humboldt [(1811), 

page ciii] 
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In the beginning, there was an ellipse. As mod- 
ern statistical methods progressed from bivariate to 
multivariate, the ellipse escaped the plane to a 3D 
ellipsoid, and then grew onward to higher dimen- 
sions. This paper extols and illustrates the virtues 
of the ellipse and her higher-dimensional cousins for 
both didactic and data analytic purposes. 

When Francis Galton (1886) first studied the re- 
lationship between heritable traits of parents and 
their offspring, he had a remarkable visual insight — 
contours of equal bivariate frequencies in the joint 
distribution seemed to form concentric shapes whose 
outlines were, to Galton, tolerably close to concen- 
tric ellipses differing only in scale. 

Galton's goal was to to predict (or explain) how 
a characteristic, Y, (e.g., height) of children was re- 
lated to that of their parents, X. To this end, he 
calculated summaries, Ave(y|X), and, for symme- 
try, Ave(V|y), and plotted these as lines of means 
on his diagram. Lo and behold, he had a second vi- 
sual insight: the lines of means of (V| V) and (V|y) 
corresponded approximately to the locus of horizon- 
tal and vertical tangents to the concentric ellipses. 
To complete the picture, he added lines showing the 
major and minor axes of the family of ellipses, with 
the result shown in Figure I. 
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DIAGRAM BA3LD ON TABLE I . 
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Fig. 1. Galton's 1886 diagram, showing the relationship of 
height of children to the average of their parents ' height. The 
diagram is essentially an overlay of a geometrical interpreta- 
tion on a hivariate grouped frequency distribution, shown as 
numbers. 

It is not stretching the point too far to say that 
a large part of modern statistical methods descends 
from these visual insights:^ correlation and regres- 
sion [Pearson (1896)], the bivariate normal distri- 
bution, and principal components [Pearson (1901), 
Hotelling (1933)] all trace their ancestry to Galton's 
geometrical diagram.^ 

Basic geometry goes back at least to Euclid, but 
the properties of the ellipse and other conic sections 
may be traced to Apollonius of Perga (ca. 262 BC- 
ca. 190 BC), a Greek geometer and astronomer who 
gave the ellipse, parabola and hyperbola their mod- 
ern names. In a work popularly called the Conies 
[Boyer (1991)], he described the fundamental prop- 
erties of ellipses (eccentricity, axes, principles of tan- 
gency, normals as minimum and maximum straight 



Pearson [(1920), page 37] later stated, "that Galton should 
have evolved all this from his observations is to my mind one 
of the most noteworthy scientific discoveries arising from pure 
analysis of observations." 

^WeU, not entirely. Auguste Bravais [1811-1863] (1846), an 
astronomer and physicist first introduced the mathematical 
theory of the bivariate normal distribution as a model for the 
joint frequency of errors in the geometric position of a point. 
Bravais derived the formula for level slices as concentric el- 
lipses and had a rudimentary notion of correlation but did 
not appreciate this as a representation of data. Nonetheless, 
Pearson (1920) acknowledged Bravais's contribution, and the 
correlation coefficient is often called the Bravais-Pearson co- 
efficient in France [Denis (2001)]. 



lines to the curve) with remarkable clarity nearly 
2000 years before the development of analytic ge- 
ometry by Descartes. 

Over time, the ellipse would be called to duty 
to provide simple explanations of phenomena once 
thought complex. IMost notable is Kepler's insight 
that the Copernican theory of the orbits of plan- 
ets as concentric circles (which required notions of 
epicycles to account for observations) could be brought 
into alignment with the detailed observational data 
from Tycho Brahe and others by an exquisitely sim- 
ple law: "The orbit of every planet is an ellipse with 
the sun at a focus." One century later, Isaac New- 
ton was able to connect this elliptical geometry with 
astrophysics by deriving all three of Kepler's laws as 
simpler consequences of general laws of motion and 
universal gravitation. 

This paper takes up the cause of the ellipse as a 
geometric form that can provide similar service to 
statistical understanding and data analysis. Indeed, 
it has been doing that since the time of Galton, but 
these graphic and geometric contributions have of- 
ten been incidental and scattered in the literature 
[e.g., Bryant (1984), Campbell and Atchley (1981), 
Saville and Wood (1991), Wickens (1995)]. We fo- 
cus here on visual insights through ellipses in the 
areas of linear models, multivariate linear models 
and mixed-effect models. Our goal is to provide as 
comprehensive a treatment of this topic as possible 
in a single article together with online supplements. 

The plan of this paper is as follows: Section 2 
provides the minimal notation and properties of el- 
lipsoids^ necessary for the remainder of the paper. 
Due to length restrictions, other useful and impor- 
tant properties of geometric and statistical ellipsoids 
have been relegated to the Appendix. Section 3 de- 
scribes the use of the data ellipsoid visual sum- 
mary for multivariate data. In Section 4 we apply 
data ellipsoids and confidence ellipsoids for parame- 
ters in linear models to explain a wide range of phe- 
nomena, paradoxes and fallacies that are clarified 
by this geometric approach. This view is extended 
to multivariate linear models in Section 5, primar- 
ily through the use of ellipsoids to portray hypoth- 
esis (H) and error (E) covariation in what we call 
HE plots. Finally, in Section 6 we discuss a diverse 



■^As in this paragraph, we generally use the term "ellipsoid" 
as to refer to "ellipse or ellipsoid" where dimensionality does 
not matter or context is clear. 
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Table 1 

Statistical and geometrical measures of "size" of an ellipsoid 



Size 


Conceptual formula 


Geometry 


Function 


(a) Generalized variance: 


det(S)=n,Ai 


area, (hyper)volume 


geometric mean 


(b) Average variance: 


tr(S) = E,A. 


linear sum 


arithmetic mean 


(c) Average precision: 


l/tr(S-i) = l/E,(lAO 




harmonic mean 


(d) Maximal variance: 


Ai 


maximum dimension 


supremum 



collection of current statistical problems whose so- 
lutions can all be described and visualized in terms 
of "kissing ellipsoids." 

2. NOTATION AND BASIC RESULTS 

There are various representations of an ellipse (or 
ellipsoid in three or more dimensions), both geomet- 
ric and statistical. Some basic notation and proper- 
ties are described below. 

2.1 Geometrical Ellipsoids 

We refer to the common notion of a bounded ellip- 
soid (with nonempty interior) in the p-dimensional 
space MP as a proper ellipsoid. An origin-centered 
proper ellipsoid may be defined by the quadratic 
form 

(1) £::={x:x'^Cx< 1}, 

where equality in equation (1) gives the boundary, 
X = {xi,X2, . . . , Xp)"^ is a vector referring to the co- 
ordinate axes and C is a symmetric positive defi- 
nite px p matrix. If C is only positive semi-definite, 
then the ellipsoid will be improper, having the shape 
of a cylinder with elliptical cross-sections and un- 
bounded in the direction of the null space of C. To 
extend the definition to singular (sometimes known 
as "degenerate") ellipsoids, we turn to a definition 
that is equivalent to equation (1) for proper ellip- 
soids. Let S denote the unit sphere in MP, 

(2) 5:={x:x"^x=l}, 

and let 

(3) £:=AS, 

where A is a nonsingular p x p matrix. Then £ is 
a proper ellipsoid that could be defined using equa- 
tion (1) with C = (AA""")"^. We obtain singular el- 
lipsoids by allowing A to be any matrix, not nec- 
essarily nonsingular or even square. A more gen- 
eral representation of ellipsoids based on the sin- 
gular value decomposition (SVD) of C is given in 
Appendix A.l. Some useful properties of geometric 
ellipsoids are described in Appendix A. 2. 



2.2 Statistical Ellipsoids 

In statistical applications, C will often be the in- 
verse of a covariance matrix (or a sum of squares 
and cross-products matrix) and the ellipsoid will be 
centered at the means of variables or at estimates of 
parameters under some model. Hence, we will also 
use the following notation: 

For a positive definite matrix 5] we use iS(/2, X)) 
to denote the ellipsoid 

(4) £::={x:(x-aa)'^I]-1(x -/!) = !}. 

When S is the covariance matrix of a multivariate 
vector x with eigenvalues Ai > A2 > • • • , the follow- 
ing properties represent the "size" of the ellipsoid in 
MP (see Table 1). 

For testing hypotheses for parameters of multi- 
variate linear models, these different senses of "size" 
correspond (with suitable transformations) to (a) 
Wilks's A, (b) the Hotelling-Lawley trace, (c) the 
Pillai trace, and (d) Roy's maximum root tests, as 
we describe below in Section 5. 

Note that every nonnegative definite matrix W 
can be factored as W = AA"*", and the matrix A 
can always be selected so that it is square. A will be 
nonsingular if and only if W is nonsingular. A com- 
putational definition of an ellipsoid that can be used 
for all nonnegative definite matrices and that cor- 
responds to the previous definition in the case of 
positive-definite matrices is 

(5) £{fx,W) = fi + AS, 

where <S is a unit sphere of conformable dimension 
and fi is the centroid of the ellipsoid. One convenient 
choice of A is the Choleski square root, W^/^, as we 
describe in Appendix A. 3. Thus, for some results 
below, a convenient notation in terms of W is 

(6) £{^l,w) = ^l®^/w = ^l®w^/^, 

where ® emphasizes that the ellipsoid is a scaling 
and rotation of the unit sphere followed by transla- 
tion to a center at fi and y/W = W-*^/^ = A. This 
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Fig. 2. Sunflower plot of Gallon's data on heights of parents and their children (in.), with 40%, 68% and 95% data ellipses 
and the regression lines of y on x (black) and x on y (grey). The ratio of the vertical to the regression line (labeled "r") to 
the vertical to the top of the ellipse gives a visual estimate of the correlation (r = 0.46, here). Shadows (projections) on the 
coordinate axes give standard intervals, x ± ks^ and y±ksy, with k = 1,1.5,2.45, having bivariate coverage 40%, 68% and 
95% and univariate coverage 68%, 87% and 98. 6%, respectively. Plotting children's height on the abscissa follows Galton. 



representation is not unique, however: /x © B = i/ © 
C (i.e., they generate the same ehipsoid) iff fj, = u 
and BB""" = CC'^ . Prom this result, it is readily seen 
that under a linear transformation given by a matrix 
L the image of the ellipse is 

(7) = L/x © VLWLT 

= L/i©L\/W. 

3. THE DATA ELLIPSE AND ELLIPSOID 

The data ellipse [Monette (1990)] [or concentra- 
tion ellipse, Dempster (1969), Chapter 7] provides 
a remarkably simple and effective display for view- 
ing and understanding bivariate marginal relation- 
ships in multivariate data. The data ellipse is typi- 
cally used to add a visual summary to a scatterplot, 
indicating the means, standard deviations, correla- 
tion and slope of the regression line for two vari- 
ables. Under classical (Gaussian) assumptions, the 
data ellipse provides a statistically sufficient visual 
summary, as we describe below. 

It is historically appropriate to illustrate the data 
ellipse and describe its properties using Galton's 
[(1886), Table I] data, from which he drew Figure 1 
as a conceptual diagram,^ shown in Figure 2, where 



''These data are reproduced in Stigler [(1986), Table 8.2, 
page 286]. 



the frequency at each point is represented by a sun- 
flower symbol. We also overlay the 40%, 68% and 
95% data ellipses, as described below. 

In Figure 2, the ellipses have the mean vector 
(x,y) as their center; the lengths of arms of the cen- 
tral cross show the standard deviations of the vari- 
ables, which correspond to the shadows of the 40% 
ellipse. In addition, the correlation coefficient can 
be visually represented as the fraction of a vertical 
tangent line from y to the top of the ellipse that is 
below the regression line y\x, shown by the arrow 
labeled "r." Finally, as Galton noted, the regres- 
sion line for y\x (or x\y) can be visually estimated 
as the locus of the points of vertical (or horizon- 
tal) tangents with the family of concentric ellipses. 
See Monette [(1990), Figures 5.1-5.2] and Friendly 
[(1991), page 183] for illustrations and further dis- 
cussion of the properties of the data ellipse. 

More formally [Dempster (1969), Monette (1990)], 
for a p-dimensional sample, Y„xpi we recognize the 
quadratic form in equation (4) as corresponding to 
the squared Mahalanobis distance, -D|j(y) = (y — 
y)''"S"^(y-y), of the point y = (2/1,1/2, • • ■,ypV from 
the centroid of the sample, y = (yi, 2/2, • • • 1 Vp)'^ ■ Thus, 
we use a more explicit notation to define the data el- 
lipsoid £c of size ( "radius" ) c as the set of all points 
y with D\i{y) less than or equal to c^, 

(8) f,(y, S) := {y : (y - y)'S~\y - y) < c^}, 

where S = (n - 1)"^ YH=i{yi - y)(yi - y"'") is the 
sample covariance matrix. In the computational no- 
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(a) (b) 

Fig. 3. Scatterplot matrices of Anderson's iris data: (a) showing data, separate 68% data ellipses, and regression lines for 
each species; {h) showing only ellipses and regression lines. Key — Iris setosa; blue, A; Iris versicolor; red, +; Iris virginca; 
green, □. 



tation of equation (6), the boundary of the data el- 
hpsoid of radius c is thus 

(9) fe(y,S)=yecSi/2_ 

Many properties of the data elhpsoid hold regard- 
less of the joint distribution of the variables; but if 
the variables are multivariate normal, then the data 
ellipsoid approximates a contour of constant density 
in their joint distribution. In this case D\.^{x,y) has 
a large-sample distribution or, in finite samples, 
approximately [p{n - 1)/ (n - p)]Fp,„_p). 

Hence, in the bivariate case, taking = X2(0-95) = 
5.99 ~ 6 encloses approximately 95% of the data 
points under normal theory. Other radii also have 
useful interpretations: 

• In Figure 2 we demonstrate that (? = X2(0-40) w 1 
gives a data ellipse of 40% coverage with the prop- 
erty that its projection on either axis corresponds 
to a standard interval, x ^Isx and y it Isy. The 
same property of univariate coverage pertains to 
any linear combination of x and y. 

• By analogy with a univariate sample, a 68% cov- 
erage data ellipse with = xKO.GS) = 2.28 gives 
a bivariate analog of the standard x ± \sx and y it 
Isy intervals. The univariate shadows, or those of 



any linear combination, then correspond to stan- 
dard Scheffe intervals taking "fishing" (simultane- 
ous interfence) in a p = 2-dimensional space into 
account. 

As useful as the data ellipse might be for a single, 
unstructured sample, its value as a visual summary 
increases with the complexity of the data. For exam- 
ple. Figure 3 shows scatterplot matrices of all pair- 
wise plots of the variables from Edgar Anderson's 

(1935) classic data on three species of iris flowers 
found in the Gaspe Peninsula, later used by Fisher 

(1936) in his development of discriminant analysis. 
The data ellipses show clearly that the means, vari- 
ances, correlations and regression slopes differ sys- 
tematically across the three iris species in all pair- 
wise plots. We emphasize that the ellipses serve as 
sufficient visual summaries of the important statis- 
tical properties (first and second moments)^ by re- 



^We recognize that a normal-theory summary (first and 
second moments), shown visually or numerically, can be dis- 
torted by multivariate outliers, particularly in smaller sam- 
ples. In what follows, robust covariance estimates can, in prin- 
ciple, be substituted for the classical, normal-theory estimates 
in all cases. To save space, we do not explore these possibilities 
further here. 
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moving the data points from the plots in the version 
at the right. 

4. LINEAR MODELS: DATA ELLIPSES AND 
CONFIDENCE ELLIPSES 

Here we consider how ehipses help to visualize re- 
lationships among variables in connection with lin- 
ear models (regression, ANOVA). We begin with 
views in the space of the variables (data space) and 
progress to related views in the space of model pa- 
rameters (/3 space). 

4.1 Simple Linear Regression 

Various aspects of the standard data ellipse of ra- 
dius 1 illuminate many properties of simple linear 
regression, as shown in Figure 4. These properties 
are also useful in more complex contexts: 

• One-half of the widths of the vertical and hor- 
izontal projections (dotted black lines) give the 
standard deviations Sx and Sy, respectively. 

• Because the perpendicular projection onto any 
line through the center of the ellipse, {x,y), corre- 
sponds to some linear combination, mx + ny, the 
half-width of the corresponding projection of the 
ellipse gives the standard deviation of this linear 
combination. 

• With a multivariate normal distribution the line 
segment through the center of the ellipse shows 




X 



Fig. 4. Annotated standard data ellipse showing standard 
deviations of x and y, residual standard deviation (se), slope 
(b) and correlation (r). 



the mean and standard deviation of the condi- 
tional distribution on that line. 

• The standard deviation of the residuals, Se-, can be 
visualized as the half-width of the vertical (red) 
line at x = x. 

• The vertical distance between the mean of y and 
the points where the ellipse has vertical tangents 
is rsy. (As a fraction of Sy, this distance is r = 0.75 
in the figure.) 

• The (blue) regression line of y on x passes through 
the points of vertical tangency. Similarly, the re- 
gression of X on y (not shown) passes through the 
points of horizontal tangency. 

4.2 Visualizing a Confidence Interval for the 
Slope 

A visual approximation to a 95% confidence inter- 
val for the slope, and thus a visual test of Hq : /3 = 0, 
can be seen in Figure 5. From the formula for a 95% 
confidence interval, CIo.95(/3) = 6±t°-^2^ x SE(6), we 
can take t^'^i' ^ 2 and SE(6) ~ -^(f^), leading to 

(10) CIo.95(/?)«fe±4=>^ f-V 

To show this visually, the left panel of Figure 5 
displays the standard data ellipse surrounded by the 
"regression parallelogram," formed with the vertical 
tangent lines and the tangent lines parallel to the 
regression line. This corresponds to the conjugate 
axes of the ellipse induced by the Choleski factor 
of Syx as shown in Figure A. 3 in Appendix A. 3. 
Simple algebra demonstrates that the diagonal lines 
through this parallelogram have slopes of 




So, to obtain a visual estimate of the 95% confi- 
dence interval for /3 {not, we note, the 95% CI for 
the regression line), we need only shrink the diago- 
nal lines of the regression parallelogram toward the 
regression line by a factor of giving the red 

lines in the right panel of Figure 5. In the data used 
for this example, n = 102, so the factor is approxi- 
mately 0.2 here.^ Now consider the horizontal line 
through the center of the data ellipse. If this line is 
outside the envelope of the confidence lines, as it is 
in Figure 5, we can reject Hq : /? = via this simple 
visual approximation. 



The data are for the rated prestige and average years of 
education of 102 Canadian occupations circa 1970; see [Fox 
and Suschnigg (1989)]. 
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6 8 10 12 14 16 6 8 10 12 14 16 

Education Education 

Fig. 5. Visual 95% confidence interval for the slope in linear regression. Left: Standard data ellipse surrounded by the 
regression parallelogram. Right: Shrinking the diagonal lines by a factor of 2/y'n, gives the approximate 95% confidence 
interval for j3. 



4.3 Simpson's Paradox, Marginal and 
Conditional Relationships 

Because it provides a visual representation of 
means, variances and correlations, the data ellipse 
is ideally suited as a tool for illustrating and expli- 
cating various phenomena that occur in the anal- 
ysis of linear models. One class of simple, but im- 
portant, examples concerns the difference between 
the marginal relationship between variables, ignor- 



ing some important factor or covariate, and the con- 
ditional relationship, adjusting (controlling) for that 
factor or covariate. 

Simpson's paradox [Simpson (1951)] occurs when 
the marginal and conditional relationships differ in 
direction. This may be seen in the plots of Sepal 
length against Sepal width for the iris data shown 
in Figure 6. Ignoring iris species, the marginal, total- 
sample correlation is slightly negative as seen in 




30 35 
Sepal width (mm; 



(a) Total sample, marginal ellipse, 
ignoring species 



(b) Individual sample, 
conditional ellipses — species 



(c) Pooled, within-sample ellipse 



Fig. 6. Marginal (a), conditional (b) and pooled withm- sample (c) relationships of Sepal length and Sepal width in the iris 
data. Total-sample data ellipses are shown as black, solid curves; individual- group data and ellipses are shown with colors and 
dashed lines. 
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panel (a). The individual-sample ellipses in panel 
(b) show that the conditional, within-species cor- 
relations are all positive, with approximately equal 
regression slopes. The group means have a negative 
relationship, accounting for the negative marginal 
correlation. 

A correct analysis of the (conditional) relation- 
ship between these variables, controlling or adjust- 
ing for mean differences among species, is based on 
the pooled within-sample covariance matrix, 

g rii 

^within = {N - gY^'^'^iYij - yi.){yij -yi-Y 
1=1 1=1 

i=l 

where N = '^ni, and the result is shown in panel (c) 
of Figure 6. In this graph, the data for each species 
were first transformed to deviations from the species 
means on both variables and then translated back 
to the grand means. 

In a more general context, Swithin appears as the 
E matrix in a multivariate linear model, adjusting or 
controlling for all fitted effects (factors and covari- 
ates). For essentially correlational analyses (princi- 
pal components, factor analysis, etc.), similar dis- 
plays can be used to show how multi-sample analy- 
ses can be compromised by substantial group mean 
differences and corrected by analysis of the pooled 
within-sample covariance matrix, or by including 
important group variables in the model. Moreover, 
display of the individual within-group data ellipses 
can show visually how well the assumption of equal 
covariance matrices. Si = S2 = • • • = Sg, is satisfied 
in the data, for the two variables displayed. 

4.4 Other Paradoxes and Fallacies 

Data ellipses can also be used to visualize and un- 
derstand other paradoxes and fallacies that occur 
with linear models. We consider situations in which 
there is a principal relationship between variables y 
and X of interest, but (as in the preceding subsec- 
tion) the data are stratified in g samples by a factor 
("group") that might correspond to different sub- 
populations (e.g., men and women, age groups), dif- 
ferent spatial regions (e.g., states), different points 
in time or some combination of the above. 

In some cases, group may be unknown or may not 
have been included in the model, so we can only es- 
timate the marginal association between y and x, 



giving a slope /^marginal and correlation ^"marginal- 

In 

other cases, we may not have individual data, but 
only aggregate group data, {yi,Xi),i = 1, . . . ,g, from 
which we can estimate the between-groups ("eco- 
logical") association, with slope /^between and cor- 
relation rbetween- When all data are available and 
the model is an ANCOVA model of the form y ~ 
X + group, we can estimate a common conditional, 
within-group slope, /^within, or, with the model y~ 
X + group + X X group, the separate within-group 
slopes, /3j. 

Figure 7 illustrates these estimates in a simula- 
tion of five groups, with rii = 10, means Xi = 2i + 
W(-0. 4,0.4) and yi = Xi + Af{0, 0.5^), so that 
J^bctwccn ~ 0.95. Here U{a,b) represents the uniform 
distribution between a and 6, and A/'(/i,o"^) repre- 
sents the normal distribution with mean /x and vari- 
ance (7^ . For simplicity, we have set the within-group 
covariance matrices to be identical in all groups, 
with Var(x) = 6, Var(?/) = 2 and Cov{x,y) = ±3 in 
the left and right panels, respectively, giving rwithin = 
±0.87. 

In the left panel, the conditional, within-group 
slope is smaller than the ecological, between-group 
slope, reflecting the smaller within-group than between- 
group correlation. In general, however, it can be 
shown that 

/^marginal ^ [/^within' /^between] ; 

which is also evident in the right panel, where the 
within-group slope is negative. This result follows 
from the fact that the marginal data ellipse for the 
total sample has a shape that is a convex combina- 
tion (weighted average) of the average within-group 
covariance of {x,y), shown by the green ellipse in 
Figure 7, and the covariance of the means {xi,yi), 
shown by the red between-group ellipse. In fact, 
the between and within data ellipses in Figure 7 
are just (a scaling of) the H and E ellipses in an 
hypothesis-error (HE) plot for the MANOVA model, 
(x, y) ~ group, as will be developed in Section 5. See 
Figure 8 for a visual demonstration, using the same 
data as in Figure 7. 

The right panels of Figures 7 and 8 provide a pro- 
totypical illustration of Simpson's paradox, where 
/^within and /^marginal Can have opposite signs. Un- 
derlying this is a more general marginal fallacy (re- 
quiring only substantively different estimates, but 
not necessarily different signs) that can occur when 
some important factor or covariate is unmeasured 
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Fig. 7. Paradoxes and fallacies: between (ecological), within (conditional) and whole-sample (marginal) associations. In both 
panels, the five groups have the same group means, and Var(a:) = 6 and Var(2/) — 2 within each group. The within-group 
correlation is r = +0.87 in all groups m the left panel and is r = —0.87 in the right panel. The green ellipse shows the average 
within-group data ellipse. 



or has been ignored. The fahacy consists of esti- 
mating the unconditional or marginal relationship 
(/^marginal) and believing that it reflects the con- 
ditional relationship, or that those pesky "other" 
variables will somehow average out. In practice, the 
marginal fallacy probably occurs most often when 



one views a scatterplot matrix of {y,xi,X2, ■ ■ ■) and 
believes that the slopes of relationships in the sepa- 
rate panels reflect the pairwise conditional relation- 
ships with other variables controlled. In a regres- 
sion context, the antidote to the marginal fallacy is 
the added- variable plot (described in Section 4.8), 




Fig. 8. Visual demonstration that (3 ^^^^^^^^ lies between (3^^^^^^^^ ''^'^'^ /^between ■ Ecich panel shows an HE plot for the MANOVA 
model {x,y) ~ group, in which the within and between ellipses are identical to those in Figure 7, except for scale. 
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which displays the conditional relationship between 
the response and a predictor directly, controlling for 
all other predictors. 

The right panels of Figures 7 and 8 also illustrate 
Robinson's paradox [Robinson (1950)], where /3withm 
and /Sbetween Can have opposite signs. The more gen- 
eral ecological fallacy [e.g., Lichtman (1974), Kramer 
(1983)] is to draw conclusions from aggregated data, 
estimating /^between or rbetween, believing that they 
reflect relationships at the individual level, estimat- 
ing /Swithin or rwithin- Perhaps the earliest instance of 
this was Andre-Michel Guerry's (1833) use of the- 
matic maps of France depicting rates of literacy, 
crime, suicide and other "moral statistics" by de- 
partment to argue about the relationships of these 
moral variables as if they reflected individual be- 
havior.^ As can be seen in Figure 7, the ecological 
fallacy can often be resolved by accounting for some 
confounding variable(s) that vary between groups. 

Finally, there are situations where only a subset 
of the relevant data are available (e.g., one group 
in Figure 7) or when the relevant data are available 
only at the individual level, so that only the con- 
ditional relationship, /3within) can be estimated. The 
atomistic fallacy (also called the fallacy of compo- 
sition or the individualistic fallacy), for example, 
Alker (1969), Riley (1963), is the inverse to the eco- 
logical fallacy and consists of believing that one can 
draw conclusions about the ecological relationship, 
/^between) from the Conditional one. 

The atomistic fallacy occurs most often in the con- 
text of multilevel models [Diez-Roux (1998)] where 
it is desired to draw inferences regarding variabil- 
ity of higher-level units (states, countries) from data 
collected from lower-level units. For example, imag- 
ine that the right panel of Figure 7 depicts the nega- 
tive relationship of mortality from heart disease (y) 
with individual income (x) for individuals within 



''William Robinson (1950) examined the relationship be- 
tween literacy rate and percentage of foreign-born immigrants 
in the U.S. states from the 1930 Census. He showed that there 
was a surprising positive correlation, rbotwoon = 0.526 at the 
state level, suggesting that foreign birth was associated with 
greater literacy; at the individual level, the correlation rwithin 
was —0.118, suggesting the opposite. An explanation for the 
paradox was that immigrants tended to settle in regions of 
greater than average literacy. 

^Guerry was certainly aware of the logical problem of eco- 
logical inference, at least in general terms [Friendly (2007a)], 
and carried out several side analyses to examine potential 
confounding variables. 



countries. It would be fallacious to infer that the 
same slope (or even its sign) applies to a between- 
country analysis of heart disease mortality vs. GNP 
per capita. A positive value of /3bctween in this con- 
text might result from the fact that, across coun- 
tries, higher GNP per capita is associated with less 
healthy diet (more fast food, red meat, larger por- 
tions), leading to increased heart disease. 

4.5 Leverage, Influence and Precision 

The topic of leverage and influence in regression 
is often introduced with graphs similar to Figure 9, 
what we call the "leverage-influence quartet." In 
these graphs, a bivariate sample of n = 20 points 
was first generated with x ~ A/'(40, 10^) and y ~ 
10 + 0.75x + J\f{0,2.5^). Then, in each of panels (b)- 
(d) a single point was added at the locations shown, 
to represent, respectively, a low-leverage point with 
a large residual,^ a high-leverage point with small 
residual (a "good" leverage point) and a high-leverage 
point with large residual (a "bad" leverage point). 
The goal is to visualize how leverage [oc {x — x)^] 
and residual {y — y*) (where y* is the fitted value 
for observation z, computed on the basis of an aux- 
iliary regression in which observation i is deleted) 
combine to produce influential points — those that 
affect the estimates oi (3 = (/3o, /Si)"''. 

The "standard" version of this graph shows only 
the fitted regression lines for each panel. So, for 
the moment, ignore the data ellipses in the plots. 
The canonical, first-moment-only, story behind the 
standard version is that the points added in pan- 
els (b) and (c) are not harmful — the fitted line does 
not change very much when these additional points 
are included. Only the bad leverage point, "OL," in 
panel (d) is harmful. 

Adding the data ellipses to each panel immedi- 
ately makes it clear that there is a second-moment 
part to the story — the effect of unusual points on 
the precision of our estimates of /3. Now, we see 
directly that there is a big difference in impact be- 
tween the low-leverage outlier [panel (b)] and the 
high-leverage, small-residual case [panel (c)], even 
though their effect on coefficient estimates is neg- 
ligible. In panel (b), the single outlier infiates the 
estimate of residual variance (the size of the vertical 
slice of the data ellipse at x). 



In this context, a residual is "large" when the point in 
question deviates substantially from the regression line for 
the rest of the data — what is sometimes termed a "deleted 
residual;" see below. 
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(c) High leverage, good fit (d) High leverage, Outlier 



Fig. 9. Leverage-Influence quartet with data ellipses, (a) Original data; (b) adding one low-leverage outlier (O); (c) adding 
one "good" leverage point (L); (d) adding one "bad" leverage point (OL). In panels (b)-(d) the dashed black line is the fitted 
line for the original data, while the thick solid blue line reflects the regression including the additional point. The data ellipses 
show the effect of the additional point on precision. 
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Fig. 10. Data ellipses in the Leverage-Influence quartet. 
This graph overlays the data ellipses and additional points 
from the four panels of Figure 9. It can be seen that only the 
OL point affects the slope, while the O and L points affect 
precision of the estimates in opposite directions. 

To make the added value of the data ehipse more 
apparent, we overlay the data ellipses from Figure 9 
in a single graph, shown in Figure 10, to allow direct 
comparison. Because you now know that regression 
lines can be visually estimated as the locus of verti- 
cal tangents, we suppress these lines in the plot to 
focus on precision. Here, we can also see why the 
high-leverage point "L" [added in panel (c) of Fig- 
ure 9] is called a "good leverage point." By increas- 
ing the standard deviation of x, it makes the data 



ellipse somewhat more elongated, giving increased 
precision of our estimates of (3. 

Whether a "good" leverage point is really good de- 
pends upon our faith in the regression model (and in 
the point), and may be regarded either as increasing 
the precision of /3 or providing an illusion of preci- 
sion. In either case, the data ellipse for the modified 
data shows the effect on precision directly. 

4.6 Ellipsoids in Data Space and /3 Space 

It is most common to look at data and fitted 
models in "data space," where axes correspond to 
variables, points represent observations, and fitted 
models are plotted as lines (or planes) in this space. 
As we've suggested, data ellipsoids provide informa- 
tive summaries of relationships in data space. For 
linear models, particularly regression models with 
quantitative predictors, there is another space — "/3 
space" — that provides deeper views of models and 
the relationships among them. In /3 space, the axes 
pertain to coefficients and points are models (true, 
hypothesized, fitted) whose coordinates represent val- 
ues of parameters. 

In the sense described below, data space and (3 
space are dual to each other. In simple linear re- 
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gression, for example, each line in data space corre- 
sponds to a point in /3 space, the set of points on 
any line in (3 space corresponds to a pencil of lines 
through a given point in data space, and the propo- 
sition that every pair of points defines a line in one 
space corresponds to the proposition that every two 
lines intersect in a point in the other space. 

Moreover, ellipsoids in these spaces are dual and 
inversely related to each other. In data space, joint 
confidence intervals for the mean vector or joint pre- 
diction regions for the data are given by the ellip- 
soids (xi ,X2Y ® c\/S. In the dual /3 space, joint con- 
fidence regions for the coefficients of a response vari- 
able y on {xi,X2) are given by ellipsoids of the form 
(3 © c\/S~^. We illustrate these relationships in the 
example below. 

Figure 11 shows a scatterplot matrix among the 
variables Heart (y), an index of cardiac damage. 
Coffee (xi), a measure of daily coffee consumption, 
and Stress (X2), a measure of occupational stress, 
in a contrived sample of n = 20. For the sake of the 



example we assume that the main goal is to deter- 
mine whether or not coffee is good or bad for your 
heart, and stress represents one potential confound- 
ing variable among others (age, smoking, etc.) that 
might be useful to control statistically. 

The plot in Figure 11 shows only the marginal re- 
lationship between each pair of variables. The mar- 
ginal message seems to be that coffee is bad for your 
heart, stress is bad for your heart and coffee con- 
sumption is also related to occupational stress. Yet, 
when we fit both variables together, we obtain the 
following results, suggesting that coffee is good for 
you (the coefficient for coffee is now negative, though 
nonsignificant). How can this be? (See Table 2). 

Figure 12 shows the relationship between the pre- 
dictors in data space and how this translates into 
joint and individual confidence intervals for the coef- 
ficients in j3 space. The left panel is the same as the 
corresponding (Coffee, Stress) panel in Figure 11, 
but with a standard (40%) data ellipse. The right 
panel shows the joint 95% confidence region and the 
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Table 2 

Coefficients and tests for the joint model predicting heart 
disease from coffee and stress 





Estimate (/3) 


Std. error 


t value 


Pr(> 


Intercept 


-7.7943 


5.7927 


-1.35 


0.1961 


Coffee 


-0.4091 


0.2918 


-1.40 


0.1789 


Stress 


1.1993 


0.2244 


5.34 


0.0001 



individual 95% confidence intervals in /3 space, de- 
termined as 

where d is the number of dimensions for which we 
want coverage, v is the residual degrees of freedom 
for Se, and Sx is the covariance matrix of the pre- 
dictors. 

Thus, the green ellipse in Figure 12 is the ellipse 

of joint 95% coverage, using the factor ^J2F^^ and 

covering the true values of (/3stressi /^Coffee) in 95% of 
samples. Moreover: 

• Any joint hypothesis (e.g., Hq : /^stress = 1, /^Coffee = 
1) can be tested visually, simply by observing 
whether the hypothesized point, (1, 1) here, lies 
inside or outside the joint confidence ellipse. 

• The shadows of this ellipse on the horizontal and 
vertical axes give the Scheffe joint 95% confidence 
intervals for the parameters, with protection for 



simultaneous inference ("fishing") in a 2-dimen- 
sional space. 

• Similarly, using the factor \Jf^i^J^ = t\, "^^'^ 
would give an ellipse whose ID shadows are 1 — a 
Bonferroni confidence intervals for d posterior hy- 
potheses. 

Visual hypothesis tests and d= \ confidence in- 
tervals for the parameters separately are obtained 
from the red ellipse in Figure 12, which is scaled 

by ^JFf^ = We call this the "confidence- 

interval generating ellipse" (or, more compactly, the 
"confidence- interval ellipse"). The shadows of the 
confidence-interval ellipse on the axes (thick red lines) 
give the corresponding individual 95% confidence 
intervals, which are equivalent to the (partial, 
Type III) t-tests for each coefficient given in the 
standard multiple regression output shown above. 
Thus, controlling for Stress, the confidence interval 
for the slope for Coffee includes 0, so we cannot re- 
ject the hypothesis that /3coffee = in the multiple 
regression model, as we saw above in the numeri- 
cal output. On the other hand, the interval for the 
slope for Stress excludes the origin, so we reject the 
null hypothesis that /3stress = 0, controlling for Cof- 
fee consumption. 

Finally, consider the relationship between the data 
ellipse and the confidence ellipse. These have exactly 
the same shape, but the confidence ellipse is exactly 




Fig. 12. Data space and (3 space representations of Coffee and Stress. Left: Standard (40%) data ellipse. Right: Joint 95% 
confidence ellipse (green) for (T^Coffco, /Sstrcssy', CI ellipse (red) with 95% univariate shadows. 
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Fig. 13. Joint 95% confidence ellipse for (ficoBee, PstTCBs), 
together with the ID marginal confidence interval for /3coffco 
ignoring Stress (thick blue line), and a visual confidence in- 
terval for /3strcss — /Scoffco = (dark cyan). 

a 90" rotation and rescaling of the data ehipse. In 
directions in data space where the shce of the data 
elhpse is wide — where we have more information 
about the relationship between Coffee and Stress — 
the projection of the confidence ellipse is narrow, 
reflecting greater precision of the estimates of coef- 
flcients. Conversely, where slice of the the data el- 
lipse is narrow (less information), the projection of 
the confidence ellipse is wide (less precision). See 
Figure A. 2 for the underlying geometry. 

The virtues of the confidence ellipse for visualiz- 
ing hypothesis tests and interval estimates do not 
end here. Say we wanted to test the hypothesis that 
Coffee was unrelated to Heart damage in the sim- 
ple regression ignoring Stress. The (Heart, Coffee) 
panel in Figure 11 showed the strong marginal rela- 
tionship between the variables. This can be seen in 
Figure 13 as the oblique projection of the confidence 
ellipse to the horizontal axis where /^stress = 0. The 
estimated slope for Coffee in the simple regression 
is exactly the oblique shadow of the center of the 
ellipse (/Scoflfee; /^Stress) through the point where the 
ellipse has a horizontal tangent onto the horizontal 
axis at /3strcss = 0. The thick blue line in this fig- 
ure shows the confidence interval for the slope for 
Coffee in the simple regression model. The confi- 
dence interval does not cover the origin, so we reject 



Ho : /3coffce = in the simple regression model. The 
oblique shadow of the red 95% confidence-interval 
ellipse onto the horizontal axis is slightly smaller. 
How much smaller is a function of the i-value of the 
coefficient for Stress? 

We can go further. As we noted earlier, all linear 
combinations of variables or parameters in data or 
models correspond graphically to projections (shad- 
ows) onto certain subspaces. Let's assume that Cof- 
fee and Stress were measured on the same scales so 
it makes sense to ask if they have equal impacts on 
Heart disease in the joint model that includes them 
both. Figure 13 also shows an auxiliary axis through 
the origin with slope = — 1 corresponding to values 
of /^Stress — PcoScc- The orthogonal projection of the 
coefficient vector on this axis is the point estimate 
of /3stress — /^Coffee and the shadow of the red ellipse 
along this axis is the 95% confidence interval for the 
difference in slopes. This interval excludes 0, so we 
would reject the hypothesis that Coffee and Stress 
have equal coefficients. 

4.7 Measurement Error 

In classical linear models, the predictors are often 
considered to be fixed variables or, if random, to 
be measured without error and independent of the 
regression errors; either condition, along with the as- 
sumption of linearity, guarantees unbiasedness of the 
standard OLS estimators. In practice, of course, pre- 
dictor variables are often also observed indicators, 
subject to error, a fact that is recognized in errors- 
in-variables regression models and in more general 
structural equation models but often ignored other- 
wise. Ellipsoids in data space and /3 space are well 
suited to showing the effect of measurement error in 
predictors on OLS estimates. 

The statistical facts are well known, though per- 
haps counter-intuitive in certain details: measure- 
ment error in a predictor biases regression coeffi- 
cients, while error in the measurement in y increases 
the standard errors of the regression coefficients but 
does not introduce bias. 

In the top row of Figure 11, adding measurement 
error to the Heart disease variable would expand 
the data ellipses vertically, but (apart from random 
variation) leaves the slopes of the regression lines 
unchanged. Measurement error in a predictor vari- 
able, however, biases the corresponding estimated 
coefficient toward zero (sometimes called regression 
attenuation) as well as increasing standard errors. 
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Fig. 14. Effects of measurement error in Stress on the marginal relationship between Heart disease and Stress. Each panel 
starts with the observed data (S — 0), then adds random normal error, A/'(0, 5 x SDstross), with 5 = {0.75, 1.0, 1.5}, to the value 
of Stress. Increasing measurement error biases the slope for Stress toward 0. Left: 50% data ellipses; right: 50% confidence 
ellipses for {/3o , /^stress) • 



Figure 14 demonstrates this effect for the marginal 
relation between Heart disease and Stress, with data 
ellipses in data space and the corresponding confi- 
dence ellipses in /3 space. Each panel starts with 
the observed data (the darkest ellipse, marked 0), 
then adds random normal error, M{0,5 x SDstress)) 
with 5 = {0.75, 1.0, 1.5}, to the value of Stress, while 
keeping the mean of Stress the same. All of the data 
ellipses have the same vertical shadows (SDncart), 
while the horizontal shadows increase with 6, driv- 
ing the slope for Stress toward 0. In f3 space, it can 
be seen that the estimated coefficients, (/3o; /^stress), 
vary along a line and approach /3strcss = for (5 suf- 
ficiently large. The vertical shadows of ellipses for 
(/3o, /3stress) along the /^stress axis also demonstrate 
the effects of measurement error on the standard 
error of /3stress- 

Perhaps less well-known, but both more surpris- 
ing and interesting, is the effect that measurement 
error in one variable, xi, has on the estimate of the 
coefficient for an other variable, X2, in a multiple 
regression model. Figure 15 shows the confidence el- 
lipses for (/Jcoffee, /^Stress) in the multiple regression 
predicting Heart disease, adding random normal er- 
ror Mi0,6 X SDstress), with 6 = {0,0.2,0.4,0.8}, to 
the value of Stress alone. As can be plainly seen, 
while this measurement error in Stress attenuates 
its coefficient, it also has the effect of biasing the 
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Fig. 15. Biasing effect of measurement error in one vari- 
able (Stress) on the coefficient of another variable (Coffee) in 
a multiple regression. The coefficient for Coffee is driven to- 
ward its value in the marginal model using Coffee alone, as 
measurement error in Stress makes it less informative in the 
joint model. 

coefficient for Coffee toward that in the marginal 
regression of Heart disease on Coffee alone. 
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Fig. 16. Added variable plots for Stress and Coffee in the multiple regression predicting Heart disease. Each panel also shows 
the 50% conditional data ellipse for residuals (Xfe,y*), shaded red. 



4.8 Ellipsoids in Added-Variable Plots 

In contrast to the marginal, bivariate views of the 
relationships of a response to several predictors (e.g., 
such as shown in the top row of the scatterplot ma- 
trix in Figure 11), added-variable plots (aka partial 
regression plots) show the partial relationship be- 
tween the response and each predictor, where the 
effects of all other predictors have been controlled 
or adjusted for. Again we find that such plots have 
remarkable geometric properties, particularly when 
supplemented by ellipsoids. 

Formally, we express the fitted standard linear 
model in vector form as y = y|X = /SqI + /3ixi + 

/32X2 H h /3pXp, with model matrix X = [1, xi , . . . , 

Xp]. Let X[_^] be the model matrix omitting the col- 
umn for variable k. Then, algebraically, the added 
variable plot for variable k is the scatterplot of the 
residuals (x^,y*) from two auxiliary regressions,"'^'^ 
fitting y and Xfc from Xj„^.], 

y* = y|others = y - y|X[_fc], 
Xfc = Xfc I others = Xfc - Xfc I X[_fc] . 

'^^These quantities can all be computed [Velleman and 
Welsh (1981)] Irom the results ol a single regression for the 
full model. 



Geometrically, in the space of the observations, 
the fitted vector y is the orthogonal projection of y 
onto the subspace spanned by X. Then y* and x^ 
are the projections onto the orthogonal complement 
of the subspace spanned by X[„fc], so the simple re- 
gression of y* on x^, has slope /3fc in the full model, 
and the residuals from the line y* = /3fcX^ in this 
plot are identically the residuals from the overall re- 
gression of y on X. 

Another way to describe the added-variable plot 
(AVP) for Xk is as a 2D projection of the space of 
(y,X), viewed in a plane projecting the data along 
the intersection of two hyperplanes: the plane of the 
regression of y on all of X, and the plane of regres- 
sion of y on X[_fc]. A third plane, that of the re- 
gression of Xfc on X[_^,], also intersects in this space 
and defines the horizontal axis in the AVP. This is 
illustrated in Figure 17, showing one view defined 
by the intersection of the three planes in the right 
panel. 

Figure 16 shows added-variable plots for Stress 
and Coffee in the multiple regression predicting Heart 



The "space of the observations" is yet a third, n- 
dimensional, space, in which the observations are the axes 
and each variable is represented as a point (or vector). See, 
for example, Fox [(2008), Chapter 10]. 

Animated 3D movies of this plot are included among the 
supplementary materials for this paper. 
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Fig. 17. 3D views of the relationship between Heart, Coffee and Stress, showing the three regression planes for the marginal 
models, Heart ~ Coffee (green), Heart ~ Stress (pink), and the joint model. Heart ~ Coffee + Stress (light blue). Left: a 
standard view; right: a view showing all three regression planes on edge. The ellipses in the side panels are 2D projections of 
the standard conditional (red) and marginal (blue) ellipsoids, as shown in Figure 18. 



disease, supplemented by data ellipses for the residu- 
als (x^„, y*). With reference to the properties of data 
ellipses in marginal scatterplots (see Figure 4), the 
following visual properties (among others) are useful 
in this discussion. These results follow simply from 
translating "marginal" into "conditional" (or "par- 
tial") in the present context. The essential idea is 
that the data ellipse of the AVP for (x^,y*) is to 
the estimate of a coefficient in a multiple regression 
as the data ellipse of (x,2/) is to simple regression. 
Thus: 

(1) The simple regression least squares fit of y* 
on has slope j3k-, the partial slope for Xk in the 
full model (and intercept = 0). 

(2) The residuals, (y* — y*), shown in this plot 
are the residuals for y in the full model. 

(3 The correlation between x^ and y*, seen in 
the shape of the data ellipse for these variables, is 
the partial correlation between y and with the 
other predictors in X[_fc] partialled out. 

(4) The horizontal half-width of the AVP data 
ellipse is proportional to the conditional standard 
deviation of x^ remaining after all other predictors 
have been accounted for, providing a visual inter- 



pretation of variance inflation due to collinear pre- 
dictors, as we describe below. 

(5) The vertical half-width of the data ellipse is 
proportional to the residual standard deviation Se 
in the multiple regression. 

(6) The squared horizontal positions, (x^)^, in the 
plot give the partial contributions to leverage on the 
coefficient (3^ of x^. 

(7) Items (3) and (7) imply that the AVP for x^ 
shows the partial influence of individual observa- 
tions on the coefficient /3fc , in the same way as in Fig- 
ure 9 for marginal models. These inffuence statistics 
are often shown numerically as DFBETA statistics 
[Belsley, Kuh and Welsch (1980)]. 

(8) The last three items imply that the collection 
of added- variable plots for y and X provide an easy 
way to visualize the leverage and influence that indi- 
vidual observations — and indeed the joint influence 
of subsets of observations — have on the estimation 
of each coefficient in a given model. 

Elliptical insight also permits us to go further, 
to depict the relationship between conditional and 
marginal views directly. Figure 18 shows the same 
added- variable plots for Heart disease on Stress and 
Coffee as in Figure 16 (with a zoomed-out scaling). 
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Fig. 18. Added-variable + marginal plots for Stress and Coffee in the multiple regression predicting Heart disease. Each 
panel shows the 50% conditional data ellipse for x1.,y* residuals (shaded, red) as well as the marginal 50% data ellipse for the 
(xk,y) variables, shifted to the origin. Arrows connect the mean-centered marginal points (open circles) to the residual points 
(filled circles). 



but here we also overlay the marginal data ellipses 
for {xk,y), and marginal regression lines for Stress 
and Coffee separately. In 3D data space, these are 
the shadows (projections) of the data ellipsoid onto 
the planes defined by the partial variables. In 2D 
AVP space, they are just the marginal data ellipses 
translated to the origin. 

The most obvious feature of Figure 18 is that the 
AVP for Coffee has a negative slope in the condi- 
tional plot (suggesting that controlling for Stress, 
coffee consumption is good for your heart), while in 
the marginal plot increasing coffee seems to be bad 
for your heart. This serves as a regression example 
of Simpson's paradox, which we considered earlier. 

Less obvious is the fact that the marginal and 
AVP ellipses are easily visualized as a shadow versus 
a slice of the full data ellipsoid. Thus, the AVP el- 
lipse must be contained in the marginal ellipse, as we 
can see in Figure 18. If there are only two x's, then 
the AVP ellipse must touch the marginal ellipse at 
two points. The shrinkage of the intersection of the 
AVP ellipse with the y axis represents improvement 
in fit due to other x's. 

More importantly, the shrinkage of the width (pro- 
jected onto a horizontal axis) represents the square 
root of the variance inflation factor (VIF) , which can 
be shown to be the ratio of the horizontal width of 



the marginal ellipse of {xk,y), with standard devia- 
tion s{xk) to the width of the conditional ellipse of 
{x^,y*), with standard deviation s(xfe|others). This 
geometry implies interesting constraints among the 
three quantities: improvement in fit, VIF, and change 
from the marginal to conditional slope. 

Finally, Figure 18 also shows how conditioning on 
other predictors works for individual observations, 
where each point of (x^,y*) is the image of (xfc,y) 
along the path of the marginal regression. This re- 
minds us that the AVP is a 2D projection of the 
full space, where the regression plane of y on ^[~k] 
becomes the vertical axis and the regression plane 
of Xfc on becomes the horizontal axis. 

5. MULTIVARIATE LINEAR MODELS: 
HE PLOTS 

Multivariate linear models (MvLMs) have a spe- 
cial affinity with ellipsoids and elliptical geometry, 
as described in this section. To set the stage and es- 
tablish notation, we consider the MvLM [e.g., Timm 
(1975)] given by the equation Y = XB + U, where 
Y is an n X p matrix of responses in which each 
column represents a distinct response variable; X is 
the n X q model matrix of full column rank for the 
regressors; B is the q x p matrix of regression co- 
efficients or model parameters; and U is the n x p 
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Table 3 

Multivariate test statistics as functions of the eigenvalues Ai solving det(H — AE) = Q or 
eigenvalues pi solving det[H — p(H + E)] — 



Criterion 


Formula 


"mean" of p 


Partial r]^ 


Wilks's A 
Pillai trace 

Hotelling-Lawley trace 
Roy maximum root 


A=n:T^^=re(i-po 


geometric 
arithmetic 
harmonic 
supremum 


v' = ^ 



matrix of errors, with vec(U) ~ A/^(0, I^®!]), where 
® is the Kronecker product. 

A convenient feature of the MvLM for general 
multivariate responses is that all tests of linear hy- 
potheses (for null effects) can be represented in the 
form of a general linear test, 

(12) Ho: L B = , 

{hxq){qxp) (hxp) 

where L is a rank h <q matrix of constants whose 
rows specify h linear combinations or contrasts of 
the parameters to be tested simultaneously by a 
multivariate test. 

For any such hypothesis of the form given in equa- 
tion (12), the analogs of the univariate sums of 
squares for hypothesis (SSh) and error (SS^) are 
the p X p sum of squares and cross-products (SSP) 
matrices given by 

(13) H = SSP J/ = (LB)'^[L(X"^X)"L'^]~\lB) 

and 

(14) E = SSP£; = Y"^Y-B"^(X"^X)B = U"^U, 

where U = Y — XB is the matrix of residuals. Multi- 
variate test statistics (Wilks's A, Pillai trace, Hotel- 
ling-Lawley trace, Roy's maximum root) for test- 
ing equation (12) are based on the s = min(p, h) 
nonzero latent roots Ai > A2 > • • • > of the matrix 
H relative to the matrix E, that is, the values of A 
for which det(H — AE) = or, equivalently, the la- 
tent roots Pi for which det[H - /9(H + E)] = 0. The 
details are shown in Table 3. These measures at- 
tempt to capture how "large" H is, relative to E 
in s dimensions, and correspond to various "means" 
as we described earlier. All of these statistics have 
transformations to F statistics giving either exact 
or approximate null-hypothesis F distributions. The 
corresponding latent vectors provide a set of s or- 
thogonal linear combinations of the responses that 
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Fig. 19. Geometry of the classical test statistics used in tests 
of hypotheses in multivariate linear models. The figure shows 
the representation of the ellipsoid generated by (H + E) rel- 
ative to E m canonical space where E* = I and (H + E)* is 
the corresponding transformation o/(H + E). 

produce maximal univariate F statistics for the hy- 
pothesis in equation (12); we refer to these as the 
canonical discriminant dimensions. 

Beyond the informal characterization of the four 
classical tests of hypotheses for multivariate linear 
models given in Table 3, there is an interesting ge- 
ometrical representation that helps one to appre- 
ciate their relative power for various alternatives. 
This can be illustrated most simply in terms of the 
canonical representation, (H-l-E)*, of the ellipsoid 
generated by (H -|- E) relative to E, as shown in 
Figure 19 for p = 2. 

With Aj as described above, the eigenvalues and 
squared radii of (H-l-E)* are Aj -|- 1, so the lengths of 
the major and minor axes are a = \/Xi + 1 and b = 
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Fig. 20. (a^ Data ellipses and (b) corresponding HE plot for sepal length and petal length m the ins data set. The H 
ellipse is the data ellipse of the fitted values defined by the group means, yi. The E ellipse is the data ellipse of the residuals, 
iyij — Yi ). Using evidence ("significance") scaling of the H ellipse, the plot has the property that the multivariate test for a 
given hypothesis is significant by Roy's largest root test iff the H ellipse protrudes anywhere outside the E ellipse. 



-v/A2 + 1, respectively. The diagonal of the triangle 
comprising the segments o, b (labeled c) has length 
c = ^/ a? + 6^ . Finally, a line segment from the origin 
dropped perpendicularly to the diagonal joining the 
two ellipsoid axes is labeled d. 

In these terms, Wilks's test, based on (1 + A^)^^, 
is equivalent to a test based on a x 6 which is propor- 
tional to the area of the framing rectangle, shown 
shaded in Figure 19. The Hotelling-Lawley trace 
test, based on X^Aj, is equivalent to a test based 
on c = sjYli Aj +P- Finally, the Pillai Trace test, 
based on X] Ai(l + Aj)~^, can be shown to be equal 
to 2 — d"^ for p = 2. Thus, it is strictly monotone in 
d and equivalent to a test based directly on d. 

The geometry makes it easy to see that if there is a 
large discrepancy between Ai and A2, Roy's test de- 
pends only on Ai while the Pillai test depends more 
on A2. Wilks's A and the Hotelling-Lawley trace cri- 
terion are also functional averages of Ai and A2, with 
the former being penalized when A2 is small. In prac- 
tice, when s < 2, all four test criteria are equivalent, 
in that their standard transformations to F statis- 
tics are exact and give rise to identical values. 

5.1 Hypothesis-Error (HE) Plots 

The essential idea behind HE plots is that any 
multivariate hypothesis test, equation (12), can be 



represented visually by ellipses (or ellipsoids beyond 
2D) that express the size of covariation against a 
multivariate null hypothesis (H) relative to error co- 
variation (E). The multivariate tests, based on the 
latent roots of HE~^, are thus translated directly 
to the sizes of the H ellipses for various hypotheses, 
relative to the size of the E ellipse. Moreover, the 
shape and orientation of these ellipses show some- 
thing more — the directions (linear combinations of 
the responses) that lead to various effect sizes and 
significance. 

Figure 20 illustrates this idea for two variables 
from the iris data set. Panel (a) shows the data el- 
lipses for sepal length and petal length, equivalent to 
the corresponding plot in Figure 3. Panel (b) shows 
the HE plot for these variables from the one-way 
MAN OVA model yij = + Ujj testing equal mean 
vectors across species, Hq : /x^^ = = Ma- Let Y be 
the nxp matrix of fitted values for this model, that 
is, Y = {yj.}. Then H = Y"''Y — nyy""" (where y is 
the grand-mean vector), and the H ellipse in the 
figure is then just the 2D projection of the data el- 
lipsoid of the fitted values, scaled as described be- 
low. Similarly, U = Y - Y, and E = U"^U = (A^ - 
(7) Spooled) SO the E ellipse is the 2D projection of 
the data ellipsoid of the residuals. Visually, the E 
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ellipsoid corresponds to shifting the separate within- 
group data ellipsoids to the centroid, as illustrated 
above in Figure 6(c). 

In HE plots, the E matrix is first scaled to a co- 
variance matrix E/d/e, dividing by the error degrees 
of freedom, dfe- The ellipsoid drawn is translated to 
the centroid y of the variables, giving y ©cE-^/^/d/g. 
This scaling and translation also allows the means 
for levels of the factors to be displayed in the same 
space, facilitating interpretation. In what follows, we 
show these as "standard" bivariate ellipses of 68% 

coverage, using c = ^2F^-^^^ , except where noted 

otherwise. 

The ellipse for H reflects the size and orientation 
of covariation against the null hypothesis. In relation 
to the E ellipse, the H ellipse can be scaled to show 
either the effect size or strength of evidence against 
Hq (significance). 

For effect-size scaling, each H is divided by dfe 
to conform to E. The resulting ellipse is then ex- 
actly the data ellipse of the fitted values, and corre- 
sponds visually to a multivariate analog of univari- 
ate effect-size measures [e.g., {yi — y2)/se where Se 
is the within-group standard deviation]. 

For significance scaling, it turns out to be most 
visually convenient to use Roy's largest root statis- 
tic as the test criterion. In this case, the H ellipse is 
scaled to H/(Aa(i/e), where is the critical value 
of Roy's statistic. Using this scaling gives a simple 
visual test of Hq: Roy's test rejects Hq at a given 
Q level iff the corresponding a-level H ellipse pro- 
trudes anywhere outside the E ellipse. Moreover, 
the directions in which the hypothesis ellipse exceed 
the error ellipse are informative about the responses 
and their linear combinations that depart signifi- 
cantly from Hq. Thus, in Figure 20(b), the variation 
of the means of the iris species shown for these two 
variables appears to be largely one-dimensional, cor- 
responding to a weighted sum (or average) of petal 



The F test based on Roy's largest root uses the ap- 
proximation F — {df2/dfi)Xi with degrees of freedom d/i,d/2, 
where d/i — max(d/(i, d/e) and d/2 = dfe — dfi + dfh- Invert- 
ing the F statistic gives the critical value for an a-level test: 
K^{dh/df2)F^f-'^,^^. 

^''Other multivariate tests (Wilks's A, Hotelling-Lawley 
trace, Pillai trace) also have geometric interpretations in HE 
plots [e.g., Wilks's A is the ratio of areas (volumes) of the H 
and E ellipses (ellipsoids); Hotelling-Lawley trace is based on 
the sum of the but these statistics do not provide such 
simple visual comparisons. All HE plots shown in this paper 
use significance scaling, based on Roy's test. 



length and sepal length, perhaps a measure of over- 
all size. 

5.2 Linear Hypotheses: Geometries of Contrasts 
and Sums of Effects 

Just as in univariate AN OVA designs, important 
overall effects (df/i > 1) in MANOVA may be use- 
fully explored and interpreted by the use of con- 
trasts among the levels of the factors involved. In 
the general linear hypothesis test of equation (12), 
contrasts are easily specified as one or more {hi x q) 
L matrices, Li,L2, . . . , each of whose rows sums to 
zero. 

As an important special case, for an overall effect 
with dfh degrees of freedom (and balanced sample 
sizes), a set of df/j pairwise orthogonal (1 x g) L 
matrices (LJLj = for i^ j) gives rise to a set of dfh 
rank-one Hj matrices that additively decompose the 
overall hypothesis SSCP matrix (by a multivariate 
analog of Pythagoras' Theorem), 

H = Hi -I- H2 H 1- Hdf ^ , 

exactly as the univariate SSh may be decomposed 
in an ANOVA. Each of these rank-one Hj matrices 
will plot as a vector in an HE plot, and their collec- 
tion provides a visual summary of the overall test, 
as partitioned by these orthogonal contrasts. Even 
more generally, where the subhypothesis matrices 
may be of rank > 1, the subhypotheses will have 
hypothesis ellipses of dimension rank(Hj) that are 
conjugate with respect to the hypothesis ellipse for 
the joint hypothesis, provided that the estimators 
for the subhypotheses are statistically independent. 

To illustrate, we show in Figure 21 an HE plot 
for the sepal width and sepal length variables in the 
iris data, corresponding to panel (1:2) in Figure 3. 
Overlayed on this plot are the one-df H matrices ob- 
tained from testing two orthogonal contrasts among 
the iris species: setosa vs. the average of versicolor 
and virginica (labeled "S:VV"), and versicolor vs. 
virginica ("V:V"), for which the contrast matrices 
are 

Li = (-2 1 1), 

L2 = (0 1 -1), 

where the species (columns) are taken in alphabet- 
ical order. In this view, the joint hypothesis testing 
equality of the species means has its major axis in 
data space largely in the direction of sepal length. 
The ID degenerate "ellipse" for Hi, representing 
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Fig. 21. H and E matrices for sepal width and sepal length 
in the iris data, together with H matrices for testing two or- 
thogonal contrasts in the species effect. 

the contrast of setosa with the average of the other 
two species, is closely aligned with this axis. The 
"ellipse" for H2 has a relatively larger component 
aligned with sepal width. 

5.3 Canonical Projections: Ellipses in Data 
Space and Canonical Space 

HE plots show the covariation leading toward re- 
jection of a hypothesis relative to error covariation 
for two variables in data space. To visualize these re- 
lationships for more than two response variables, we 
can use the obvious generalization of a scatterplot 
matrix showing the 2D projections of the H and E 
ellipsoids for all pairs of variables. Alternatively, a 
transformation to canonical space permits visualiza- 
tion of all response variables in the reduced-rank 2D 
(or 3D) space in which H covariation is maximal. 

In the MAN OVA context, the analysis is called 
canonical discriminant analysis (CDA), where the 
emphasis is on dimension reduction rather than hy- 
pothesis testing. For a one-way design with g groups 
and p-variate observations i in group j, yij, CDA 
finds a set of s = mm(p,g — 1) linear combinations, 
zi = cjy, Z2 = cjy, ...,Zs= cjy, so that: (a) ah Zk 
are mutually uncorrelated; (b) the vector of weights 
ci maximizes the univariate F statistic for the lin- 
ear combination zi; (c) each successive vector of 
weights, Cfc, k = 2, . . . , s, maximizes the univariate 



F-statistic for z^., subject to being uncorrelated with 
all other linear combinations. 

The canonical projection of Y to canonical scores 
Z is given by 

(15) Y„xp^Z„xs = YE-iV/d/e, 

where V is the matrix whose columns are the eigen- 
vectors of HE~^ associated with the ordered nonzero 
eigenvalues, Xi,i = 1, . . . , s. A MANOVA of all s lin- 
ear combinations is statistically equivalent to that 
of the raw data. The Aj are proportional to the 
fractions of between-group variation expressed by 
these linear combinations. Hence, to the extent that 
the first one or two eigenvalues are relatively large, 
a two-dimensional display will capture the bulk of 
between-group differences. The 2D canonical dis- 
criminant HE plot is then simply an HE plot of the 
scores zi and Z2 on the first two canonical dimen- 
sions. (If s > 3, an analogous 3D version may also 
be obtained.) 

Because the z scores are all mutually uncorre- 
lated, the H and E matrices will always have their 
axes aligned with the canonical dimensions. When, 
as here, the z scores are standardized, the E ellipse 
will be circular, assuming that the axes in the plot 
are equated so that a unit data length has the same 
physical length on both axes. 

Moreover, we can show the contributions of the 
original variables to discrimination as follows: Let 
P be the p x s matrix of the correlations of each 
column of Y with each column of Z, often called 
canonical structure coefficients. Then, for variable 
j, a vector from the origin to the point whose coor- 
dinates p.j are given in row j of P has projections 
on the canonical axes equal to these structure coef- 
ficients and squared length equal to the sum squares 
of these correlations. 

Figure 22 shows the canonical HE plot for the 
iris data, the view in canonical space corresponding 
to Figure 21 in data space for two of the variables 
(omitting the contrast vectors) . Note that for g = 3 
groups, dfh = 2, so s = 2 and the representation in 
2D is exact. This provides a very simple interpreta- 
tion: Nearly all (99.1%) of the variation in species 
means can be accounted for by the first canonical 
dimension, which is seen to be aligned with three of 
the four variables, most strongly with petal length. 
The second canonical dimension is mostly related to 
variation in the means on sepal width, and this vari- 
able is negatively correlated with the other three. 
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Fig. 22. Canonical HE plot for the Iris data. In this plot, 
the H ellipse is shown using effect-size scaling to preserve 
resolution, and the variable vectors have been multiplied by a 
constant to approximately fill the plot space. The projections 
of the variable vectors on the coordinate axes show the corre- 
lations of the variables with the canonical dimensions. 

Finally, imagine a 4D version of the HE plot of 
Figure 21 in data space, showing the four-dimension- 
al ellipsoids for H and E. Add to this plot unit 
vectors corresponding to the coordinate axes, scaled 
to some convenient constant length. Some rotation 
would show that the H ellipsoid is really only two- 
dimensional, while E is 4D. Applying the transfor- 
mation given by E"^ as in Figure A. 4 and projecting 
into the 2D subspace of the nonzero dimensions of 
H would give a view equivalent to the canonical HE 
plot in Figure 22. The variable vectors in this plot 
are just the shadows of the original coordinate axes. 

6. KISSING ELLIPSOIDS 

In this section we consider some circumstances in 
which there is a data stratification factor or there 
are two (or more) principles or procedures for deriv- 
ing estimates of a parameter vector /3 of a linear 
model, each with its associated estimated covari- 
ance matrix, for example, with covariance matrix 
Var(^'^) and f3^ with covariance matrix Var(;9^). 
The simplest motivating example is two-group dis- 
criminant analysis (Section 6.2). In data space, so- 
lutions to this statistical problem can be described 
geometrically in terms of the property that the data 
ellipsoids around the group centroids will just "kiss" 



(or osculate) along a path between the two cen- 
troids. We call this path the locus of osculation, 
whose properties are described in Section 6.1. 

Perhaps more interesting and more productive is 
that the same geometric ideas apply equally well in 
parameter (/3) space. Consider, for example, method 
A to be OLS estimation and several alternatives for 
method B, such as ridge regression (Section 6.3) or 
Bayesian estimation (Section 6.4). The remarkable 
fact is that the geometry of such kissing ellipsoids 
provides a clear visual interpretation of these cases 
and others, whenever we consider a convex combi- 
nation of information from two sources. In all cases, 
the locus of osculation is interpretable in terms of 
the statistical goal to be achieved, taking precision 
into account. 

6.1 Locus of Osculation 

The problems mentioned above all have a simi- 
lar and simple physical interpretation: Imagine two 
stones dropped into a pond at locations with co- 
ordinates mi and m2. The waves emanating from 
the centers form concentric circles which osculate 
along the line from mi to m2 . Now imagine a world 
with ellipse-generating stones, where instead of cir- 
cles, the waves form concentric ellipses determined 
by the shape matrices Ai and A2. The locus of os- 
culation of these ellipses will be the set of points 
where the tangents to the two ellipses are parallel 
(or, equivalently, that their normals are parallel). An 
example is shown in Figure 23, using mi = (—2,2), 
m2 = (2, 6), and 

, , . /l.O 0.5\ . / 1.5 -0.3\ 

where we have found points of osculation by trial 
and error. 

An exact general solution can be described as fol- 
lows: Let the ellipses for i = l,2 be given by 

/i(x) = (x - mi)"'"Ai(x - mi), 

{x : /i (x) = c^} = mi e Va" 

and denote their gradient-vector functions as 

(17) V/(....)=(|^,|I) 
so that 

V/i(x) = 2Ai(x-mi), 
V/2(x) = 2A2(x-m2). 
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Fig. 23. Locus of osculation for two families of ellipsoidal 
level curves, with centers at mi = (—2, 2) and m2 = (2, 6), and 
shape matrices Ai and A2 given m equation (16). The left 
ellipsoids (red) have radii = 1,2,3. The right ellipsoids have 
radii = 1,1.74,3.1, where the last two values were chosen to 
make them kiss at the points marked with squares. The black 
curve is an approximation to the path of osculation, using a 
spline function connecting mi to m2 via the marked points of 
osculation. 
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Then, the points where V/i and V/2 are parallel 
can be expressed in terms of the condition that their 
vector cross product u © v = uif 2 — U2V1 = v"'"Cu = 
0, where C is the skew-symmetric matrix 

satisfying C = — C^. Thus, the locus of osculation is 
the set O, given by O = {x G R2|V/i(x) ® V/2(x) = 
0}, which implies 

(18) (x- m2)"^AjCAi(x- mi) = 0. 

Equation (18) is a bilinear form in x, with central 
matrix AJCAi, implying that O is a conic section 
in the general case. Note that when x = mi or x = 
m2, equation (18) is necessarily zero, so the locus of 
osculation always passes through mi and m2. 

A visual demonstration of the theory above is 
shown in Figure 24 (left), which overlays the ellipses 
in Figure 23 with contour lines (hyperbolae, here) of 
the vector cross-product function contained in equa- 
tion (18). When the contours of /i and /2 have the 
same shape (Ai = CA2), as in the right panel of 
Figure 24, equation (18) reduces to a line, in accord 
with the stones-in-pond interpretation. The above 
can be readily extended to ellipsoids in a higher di- 
mension, where the development is more easily un- 
derstood in terms of normals to the surfaces. 
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Fig. 24. Locus of osculation for two families of ellipsoidal level curves, showing contour lines of the vector cross-product 
function equation (18). The thick black curve shows the complete locus of osculation for these two families of ellipses, where 
the cross-product function equals 0. Left: with parameters as in Figure 23 and equation (16). Right: with the same shape matrix 
Ai for both ellipsoids. 
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6.2 Discriminant Analysis 

The right panel of Figure 24, considered in data 
space, provides a visual interpretation of the clas- 
sical, normal theory two-group discriminant analy- 
sis problem under the assumption of equal popula- 
tion covariance matrices. Si = Here, we imag- 
ine that the plot shows the contours of data ellip- 
soids for two groups, with mean vectors mi and 
m2, and common covariance matrix A = Spooled = 
[(?ii - l)Si + (n2 - l)S2]/(ni + n2 - 2). 

The discriminant axis is the locus of osculation be- 
tween the two families of ellipsoids. The goal in dis- 
criminant analysis, however, is to determine a classi- 
fication rule based on a linear function, P(x) = b"''x, 
such that an observation x will be classified as be- 
longing to Group 1 if 2?(x) < d, and to Group 2 
otherwise. In linear discriminant analysis, the dis- 
criminant function coefficients are given by 



Spooled ("11 



m2j 



All boundaries of the classification regions deter- 
mined by d will then be the tangent lines (planes) to 
the ellipsoids at points of osculation. The location 
of the classification region along the line from mi to 
m2 typically takes into account both the prior prob- 
abilities of membership in Groups 1 and 2, and the 
costs of misclassification. Similarly, the left panel of 
Figure 24 is a visual representation of the same prob- 
lem when Si 7^ S2, giving rise to quadratic classifi- 
cation boundaries. 

6.3 Ridge Regression 

In the univariate linear model, y = X/3 -|- e, high 
multiple correlations among the predictors in X lead 
to problems of coUinearity — unstable OLS estimates 
of the parameters in (3 with inflated standard errors 
and coefficients that tend to be too large in abso- 
lute value. Although collinearity is essentially a data 
problem [Fox (2008)], one popular (if questionable) 
approach is ridge regression, which shrinks the es- 
timates toward (introducing bias) in an effort to 
reduce sampling variance. 

Suppose the predictors and response have been 
centered at their means and the unit vector is omit- 
ted from X. Further, rescale the columns of X to 
unit length, so that X"'"X is a correlation matrix. 
Then, the OLS estimates are given by 



Ridge regression replaces the standard residual sum 
of squares criterion with a penalized form, 

RSS(A:) = (y - X/3)T(y - X/3) + A;/3"^/3, 

(20) 



(fc>0), 



whose solution is easily seen to be 



(21) 



3^^ = (XTX + H)^1XV 



where G = [I fc(X'^X)-i]-i. Thus, as the "ridge 
constant" k increases, G decreases, driving /3^^ to- 
ward [Hoerl and Kennard (1970a, 1970b)]. The 
addition of a positive constant k to the diagonal of 
X'''X drives det(X"'"X -|- kl) away from zero even if 
det(X"^X) ^ 0. 

The penalized Lagrangian formulation in equa- 
tion (20) has an equivalent form constrained 
minimization problem, 

^RR ^ argmin(y - X/3)"^(y - X/3) 

13 

(22) 

subject to f3^f3<t{k), 

which makes the size constraint on the parameters 
explicit, with t{k) an inverse function of k. This form 
provides a visual interpretation of ridge regression, 
as shown in Figure 25. Depicted in the figure are 



P2 





(19) 



3OLS ^ (xTx)-lxTy. 



Fig. 25. Elliptical contours of the OLS residual sum of 
squares for two parameters in a regression, together with cir- 
cular contours for the constraint function, /3i + /Jf < Ridge 
regression finds the point /3^^ where the OLS contours just 
kiss the constraint region. 
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the elliptical contours of the OLS regression sum of 
squares, RSS(O) around P^^^. Each ellipsoid marks 
the point closest to the origin, that is, with min /S. 
It is easily seen that the ridge regression solution is 
the point where the elliptical contours just kiss the 
constraint contour. 

Another insightful interpretation of ridge regres- 
sion [Marquardt (1970)] sees the ridge estimator as 
equivalent to an OLS estimator, when the actual 
data in X are supplemented by some number of fic- 
titious observations, n{k), with uncorrelated predic- 
tors, giving rise to an orthogonal matrix, and 
where y = for all supplementary observations. The 
linear model then becomes 

which gives rise to the solution 

(24) 3'''' = [X"^X + {Xlfxlr'jCy. 

But because X^ is orthogonal, (X2)"'"X2 is a scalar 
multiple of I, so there exists some value of k mak- 
ing equation (24) equivalent to equation (21). As 
promised, the ridge regression estimator then re- 
flects a weighted average of the data [X,y] with 
n{k) observations [X^,0] biased toward (3 = 0. In 
Figure 25, it is easy to imagine that there is a direct 
translation between the size of the constraint region, 
t{k), and an equivalent supplementary sample size, 
n{k), in this interpretation. 

This classic version of the ridge regression problem 
can be generalized in a variety of ways, giving other 
geometric insights. Rather than a constant multi- 
plier k of (3 as the penalty term in equation (20), 
consider a penalty of the form 0^^.(3 with a positive 
definite matrix K. The choice K = diag(/ci, /c2i • • •) 
gives rise to a version of Figure 25 in which the con- 
straint contours are ellipses aligned with the coordi- 
nate axes, with axis lengths inversely proportional 
to ki. These constants allow for differential shrinkage 
of the OLS coefficients. The visual solution to the 
obvious modification of equation (22) is again the 
point where the elliptical contours of RSS(O) kiss the 
contours of the (now elliptical) constraint region. 

6.3.1 Bivariate ridge trace plots Ridge regression 
is touted (optimistically we think) as a method to 
counter the effects of collinearity by trading off a 
small amount of bias for an advantageous decrease in 
variance. The results are often visualized in a ridge 
trace plot [Hoerl and Kennard (1970b)], showing the 
changes in individual coefficient estimates as a func- 



tion of A;. A bivariate version of this plot, with confi- 
dence ellipses for the parameters, is introduced here. 
This plot provides greater insight into the effects of 
k on coefficient variance. 

Confidence ellipsoids for the OLS estimator are 
generated from the estimated covariance matrix of 
the coefficients, 

Va^(/30LS)=a2(xTx)-\ 

For the ridge estimator, this becomes [Marquardt 
(1970)] 

Y^r{f3^^) = (72[X'^X + A;I]"\X"^X) 

(25) 

•[X"^X + A;I]-\ 

which coincides with the OLS result when k = 0. 

Figure 26 uses the classic Longley (1967) data to 
illustrate bivariate ridge trace plots. The data con- 
sist of an economic time series (n = 16) observed 
yearly from 1947 to 1962, with the number of people 
Employed as the response and the following predic- 
tors: GNP, Unemployed, Armed. Forces, Population, 
Year, and GNRdefiator (using 1954 as 100).^*^ For 
each value of k, the plot shows the estimate /3, to- 
gether with the variance ellipse. For the sake of this 
example, we assume that GNP is a primary predic- 
tor of Employment, and we wish to know how other 
predictors modify the regression estimates and their 
variance when ridge regression is used. 

For these data, it can be seen that even small val- 
ues of k have substantial impact on the estimates (3. 



^^Bias and mean-squared error are a different matter: Al- 
thougli Hoerl and Kennard (1970a) demonstrate tliat there is 
a range of values for the ridge constant k for which the MSE 
of the ridge estimator is smaller than that of the OLS estima- 
tor, to know where this range is located requires knowledge of 
(3. As we explain in the following subsection, the constraint 
on incorporated in the ridge estimator can be construed 
as a Bayesian prior; the fly in the ointment of ridge regres- 
sion, however, is that there is no reason to suppose that the 
ridge-regression prior is in general reasonable. 

^^Longley (1967) used these data to demonstrate the ef- 
fects of numerical instability and round-off error in least 
squares computations based on direct computation of the 
cross-products matrix, X^X. Longley's paper sparked the de- 
velopment of a wide variety of numerically stable least squares 
algorithms (QR, modified Gram-Schmidt, etc.) now used is 
almost all statistical software. Even ignoring numerical prob- 
lems (not to mention problems due to lack of independence), 
these data would be anticipated to exhibit high collinearity 
because a number of the predictors would be expected to have 
strong associations with year and/or population, yet both of 
these are also included among the predictors. 
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Fig. 26. Bivariate ridge trace plots for the coefficients of Unemployed and Population against the coefficient for GNP in 
Longley's data, with = 0,0.005,0.01,0.02, 0.04, 0.08. In both cases the coefficients are driven on average toward zero, hut 
the bivariate plot also makes clear the reduction in variance. To reduce overlap, all variance ellipses are shown with 1/2 the 
standard radius. 



What is perhaps more dramatic (and unseen in uni- 
variate trace plots) is the impact on the size of the 
variance elhpse. Moreover, shrinkage in variance is 
generally in a similar direction to the shrinkage in 
the coefficients. This new graphical method is de- 
veloped more fully in Friendly (2013), including 2D 
and 3D plots, as well as more informative represen- 
tations of shrinkage by ellipsoids in the transformed 
space of the SVD of the predictors. 

6.4 Bayesian Linear Models 

In a Bayesian alternative to standard least squares 
estimation, consider the case where our prior infor- 
mation about (3 can be encapsulated in a distribu- 
tion with a prior mean ^p''^"'' and covariance matrix 
A. We show that under reasonable conditions the 
Bayesian posterior estimate, ^P^^t'^^'i™ ^ turns out to 
be a weighted average of the prior coefficients fy^^°^ 
and the OLS solution (3^^^, with weights propor- 
tional to the conditional prior precision, A~^, and 
the data precision given by X"'"X. Once again, this 
can be understood geometrically as the locus of os- 
culation of ellipsoids that characterize the prior and 
the data. 

Under Gaussian assumptions, the conditional like- 
lihood can be written as 



>C(y|X,/3,a2) 

oc (a2)-"/2 



exp 



-^(y-X/3)T(y-X/3) 



To focus on alternative estimators, we can complete 
the square around /3 = (3^^^ to give 



(y-X/3)T(y. 
(26) 



X/3) = (y-X3)^(y-X3) 

+ (/3-3)"'(XTX)(/3-3). 

With a little manipulation, a conjugate prior, of 
the form Pr(/3, cr^) = Pr(/3|cr^) x Pr(c7^), can be ex- 
pressed with Pr(a"^) an inverse gamma distribution 
depending on the first term on the right-hand side of 
equation (26) and Pr(/3|cj^) a normal distribution, 

Pr(/3|a2) 
(27) oc(a2)-^ 

1 



• exp 



2ct2 



(/3 - (3^''°'YA{f3 - 13^'''°') 



The posterior distribution is then Pr(/3, a^ly, X) oc 
Pr(y|X, /3, cr^) x Fv{(3\a'^) x FT{a'^), whence, after 
some simplification, the posterior mean can be ex- 
pressed as 

(28) 3?°^*'="°'- = (X^X + A)-^(X"^X3°^^ + A/SP"""") 

with covariance matrix (X'''X -|- A)~^. The poste- 
rior coefficients are therefore a weighted average of 
the prior coefficients and the OLS estimates, with 
weights given by the conditional prior precision, A, 
and the data precision, X"'"X. Thus, as we increase 
the strength of our prior precision (decreasing prior 
variance), we place greater weight on our prior be- 
liefs relative to the data. 
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In this context, ridge regression can be seen as 
the special case where /JP""'' = and A = kl, and 
where Figure 25 provides an elhptical visuahzation. 
In equation (24), the number of observations, n{k) 
corresponding to X^, can be seen as another way of 
expressing the weight of the prior in relation to the 
data. 

6.5 Mixed Models: BLUEs and BLUPs 

In this section we make implicit use of the du- 
ality between data space and /3 space, where lines 
in one map into points in the other and ellipsoids 
help to visualize the precision of estimates in the 
context of the linear mixed model for hierarchical 
data. We also show visually how the best linear un- 
biased predictors (BLUPs) from the mixed model 
can be seen as a weighted average of the best linear 
unbiased estimates (BLUEs) derived from OLS re- 
gressions performed within clusters of related data 
and the overall mixed model GLS estimate. 

The mixed model for hierarchical data provides 
a general framework for dealing with dependence 
among observations in linear models, such as occurs 
when students are sampled within schools, schools 
within counties and so forth [e.g., Raudenbush and 
Bryk (2002)]. In these situations, the assumption 
of OLS that the errors are conditionally indepen- 
dent is probably violated, because, for example, stu- 
dents nested within the same school are likely to 
have more similar outcomes than those from differ- 
ent schools. Essentially the same model, with pro- 
vision for serially correlated errors, can be applied 
to longitudinal data [e.g., Laird and Ware (1982)], 
although we will not pursue this application here. 

The mixed model for the rii x 1 response vector 
Yi in cluster i can be given as 

yi = X,f3 + ZiUi + ei, 

(29) u,~AA,(0,Gi), 

ei~A/'„,(0,R,), 

where /3 is a p x 1 vector of parameters correspond- 
ing to the fixed effects in the Ui x p model matrix 
Xj; Uj is a g X 1 vector of coefficients corresponding 
to the random effects in the riiX q model matrix Zj ; 
Gj is the q X q covariance matrix of the random ef- 
fects in Uj; and Rj is the rii x rii covariance matrix 
of the errors in Sj. 

Stacking the yj, Xj, Zj and so forth in the obvious 
way then gives 

(30) y = X/3 + Zu + e, 



where u and e are assumed to have normal distri- 
butions with mean and 



(31) 



Var 





G 


0" 







R 



where G = diag(Gi , . . . , G^) , R = diag(Ri , . . . , R^) 
and m is the number of clusters. The variance of y 
is therefore V = ZGZ""" -|- R, and when Z = and 
R = cr^I, the mixed model in equation (30) reduces 
to the standard linear model. 

We now consider the case in which Zj = Xj and we 
wish to predict /3j = /3 -|- Uj , the vector of parameters 
for the ith cluster. At one extreme, we could simply 
ignore clusters and use the common mixed-model 
generalized- least-square estimate, 



(32) 



^GLS ^ (xTv-ix)"^x'^v-V, 



whose sampling variance is Var(/3^'^^) = (X""" V -""X) 
It is an unbiased predictor of /3j since E{(3^^^ — 
f3j) = 0. With moderately large m, the sampling 
variance may be small relative to Gj and Var(/3'^^^ — 
A)«G,. 

At the other extreme, we ignore the fact that clus- 
ters come from a common population and we calcu- 
late the separate BLUE estimate within each clus- 
ter. 



(33) 



/3. 



blue 



(X^Xi) 



Xj yj 



with Var(3""1/3J = S, = ^^(XTXi)' 



Both extremes have drawbacks: whereas the pooled 
overall GLS estimate ignores variation between clus- 
ters, the unpooled within-cluster BLUE ignores the 
common population and makes clusters appear to 
differ more than they actually do. 

This dilemma led to the development of BLUPs 
(best linear unbiased predictor) in models with ran- 
dom effects [Henderson (1975), Robinson (1991), 
Speed (1991)]. In the case considered here, the BLUPs 
are an inverse- variance weighted average of the mixed- 
model GLS estimates and of the BLUEs. The BLUP 
is then 



(34) 



/3, 



blup 



(sr' + G, 



(s-i3r"'' + Gr'3?^')- 



This "partial pooling" optimally combines the infor- 
mation from cluster i with the information from all 
clusters, shrinking ^5]^'"° toward /J*^^^. Shrinkage for 
a given parameter /3jj is greater when the sample 
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BLUEs vs. BLUPS: Intercepts BLUEs vs. BLUPS: Slopes 




Intercept: Avg Math at CSES=0 (BLUE) CSES slope (BLUE) 

Fig. 27. Comparing BLUEs and BLUPs. Each panel plots the OLS estimates from separate regressions for each school 
(BLUEs) versus the mixed model estimates from the random intercepts and slopes model (BLUPs). Left: intercepts; Right: 
slopes for CSES. The shrinkage of the BL UPs toward the CLS estimate is much greater for slopes than intercepts. 



size rii is small or when the variance of the corre- 
sponding random effect, gijj, is small. 

Equation (34) is of the same form as equation (28) 
and other convex combinations of estimates consid- 
ered earlier in this section. So once again, we can 
understand these results geometrically as the locus 
of osculation of ellipsoids. Ellipsoids kiss for a rea- 
son: to provide an optimal convex combination of 
information from two sources, taking precision into 
account. 

6.5.1 Example: Math achievement and SES To il- 
lustrate, we use a classic data set from Bryk and 
Raudenbush (1992) and Raudenbush and Bryk (2002) 
dealing with math achievement scores for a sub- 
sample of 7185 students from 160 schools in the 
1982 High School & Beyond survey of U.S. pub- 
lic and Catholic high schools conducted by the Na- 
tional Center for Education Statistics (NCES). The 
data set contains 90 public schools and 70 Catholic 
schools, with sample sizes ranging from 14 to 67. 

The response is a standardized measure of math 
achievement, while student-level predictor variables 
include sex and student socioeconomic status (SES), 
and school-level predictors include sector (public or 
Catholic) and mean SES for the school (among other 
variables). Following Raudenbush and Bryk (2002), 
student SES is considered the main predictor and is 
typically analyzed centered within schools, CSES^j = 



SESjj — (mean SES)^, for ease of interpretation (mak- 
ing the within-school intercept for school i equal to 
the mean SES in that school). 

For simplicity, we consider the case of CSES as 
a single quantitative predictor in X in the example 
below. We fit and compare the following models: 

(35) y,~AA(/3o + Xi^i,a2) pooled OLS, 

(36) Yi ~ M{/3oi + Xi/3ii,af) unpooled BLUEs, 

Yi ~ M{f3o + XiPi + uoi + XiUii,af) 

(37) 

BLUPs: random intercepts and slopes, 

and also include a fixed effect of sector, common 
to all models; for compactness, the sector effect is 
elided in the notation above. 

In expositions of mixed-effects models, such mod- 
els are often compared visually by plotting predicted 
values in data space, where each school appears as a 
fitted line under one of the models above (sometimes 
called "spaghetti plots"). Our geometric approach 
leads us to consider the equivalent but simpler plots 
in the dual /3 space, where each school appears as a 
point. 

Figure 27 plots the unpooled BLUE estimates 
against the BLUPs from the random effects model, 
with separate panels for intercepts and slopes to 
illustrate the shrinkage of different parameters. In 
these data, the variance in intercepts (average math 
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Fig. 28. Comparing BLUEs and BLUPs. The plot shows 
ellipses of 50% coverage for the estimates of intercepts and 
slopes from OLS regressions (BLUEs) and the mixed model 
(BLUPs), separately for each sector. The centers of the el- 
lipses illustrate how the BLUPS can be considered a weighted 
average of the BLUEs and the mixed-model GLS estimate, 
ignoring sector. The relative sizes of the ellipses reflect the 
smaller variance for the BL UPs compared to the BL UEs, par- 
ticularly for slope estimates. 

achievement for students at CSES = 0), qqq, among 
schools in each sector is large, so the mixed-effects 
estimates have small weight and there is little shrink- 
age. On the other hand, the variance component for 
slopes, (7ii, is relatively small, so there is greater 
shrinkage toward the GLS estimate. 

For the present purposes, a more useful visual 
representation of these model comparisons can be 
shown together in the space of (/3o,/3i), as in Fig- 
ure 28. Estimates for individual schools are not shown, 
but rather these are summarized by the ellipses of 
50% coverage for the BLUEs and BLUPs within 
each sector. The centers of the ellipsoids indicate 
the relatively greater shrinkage of slopes compared 
to intercepts. The sizes of ellipsoids show directly 
the greater precision of the BLUPs, particularly for 
slopes. 

6.6 Multivariate Meta-Analysis 

A related situation arises in random effects mul- 
tivariate meta-analysis [Berkey et al. (1998), Nam, 
Mengersen and Garthwaite (2003)], where several 
outcome measures are observed in a series of sim- 
ilar research studies and it is desired to synthesize 
those studies to provide an overall (pooled) sum- 
mary of the outcomes, together with meta-analytic 



inferences and measures of heterogeneity across stud- 
ies. 

The application of mixed model ideas in this con- 
text differs from the standard situation in that indi- 
vidual data are usually unavailable and use is made 
instead of summary data (estimated treatment ef- 
fects and their covariances) from the published lit- 
erature. The multivariate extension of standard uni- 
variate methods of meta-analysis allows the correla- 
tions among outcome effects to be taken into ac- 
count and estimated, and regression versions can 
incorporate study-specific covariates to account for 
some inter-study heterogeneity. More importantly, 
we illustrate a graphical method based (of course) on 
ellipsoids that serves to illustrate bias, heterogeneity 
and shrinkage in BLUPs, and the optimism of fixed- 
effect estimates when study heterogeneity is ignored. 

The general mixed-effects multivariate meta-anal- 
ysis model can be written as 

(38) yi = Xi/3 + <5i + ei, 

where is a vector of p outcomes (means or treat- 
ment effects) for study i; Xj is the matrix of study- 
level predictors for study i or a unit vector when no 
covariates are available; /3 is the population-averaged 
vector of regression parameters or effects (intercepts, 
means) when there are no covariates; 5i is the p- 
vector of random effects associated with study z, 
whose p X p covariance matrix A represents the be- 
tween-study heterogeneity unaccounted for by Xj/3; 
and, finally, is the p- vector of random sampling 
errors (independent of 8i) within study i, having the 
p X p covariance matrix Si . 

With suitable distributional assumptions, the 
mixed-effects model in equation (38) implies that 

(39) y,~AAp(Xi/3,A + S,) 

with Var(yj) = A -h Sj. When all the Si = 0, and 
thus A = 0, equation (38) reduces to a fixed-effects 
model, Yi = ^i(3 -\- ej, which can be estimated by 
GLS to give 

(40) 3^^^ = (x'^sx)"^x'^s-V, 

(41) Y^v0^^^) = {X.'^SXr\ 

where y and X are the stacked yj and Xj, and S 
is the block-diagonal matrix containing the Sj. The 
fixed-effects model ignores unmodeled heterogene- 
ity among the studies, however, and consequently 
the estimated effects in equation (40) may be bi- 
ased and the estimated uncertainty of these effects 
in equation (41) may be too small. 
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Fig. 29. Multivariate meta-analysis visualizations for five periodontal treatment studies with outcome measures PD and 
AL. Left: Individual study estimates yi and 40% standard ellipses for Si (dashed, red) together with the pooled, fixed effects 
estimate and its associated covariance ellipse (blue). Right: BLUPs from the random- effects multivariate meta-analysis model 
and their associated covariance ellipses (red, solid), together with the pooled, population-averaged estimate and its covariance 
ellipse (blue), and the estimate of the between-study covariance matrix, A (green). Arrows show the differences between the 
FE and the RE models. 



The example we use here concerns the compari- 
son of surgical (S) and nonsurgical (NS) procedures 
for the treatment of moderate periodontal disease 
in five randomized split-mouth design clinical tri- 
als [Antczak-Bouckoms et al. (1993), Berkey et al. 
(1998)]. The two outcome measures for each patient 
were pre- to post-treatment changes after one year 
in probing depth (PD) and attachment level (AL), 
in mm, where successful treatment should decrease 
probing depth and increase attachment level. Each 
study was summarized by the mean difference, = 
(yf ~ yi^'^)) between S and NS treated teeth, to- 
gether with the covariance matrix Sj for each study. 
Sample sizes ranged from 14 to 89 across studies. 

The left panel of Figure 29 shows the individ- 
ual study estimates of PD and AL together with 
their covariances ellipses in a generic form that we 
propose as a more useful visualization of multivari- 
ate meta-analysis results than standard tabular dis- 
plays: individual estimates plus model-based sum- 
mary, all with associated covariance ellipsoids. 

It can be seen that all studies show that surgi- 
cal treatment yields better probing depth (estimates 



The analyses described here were carried out using the 
mvmeta package for R [Gasparrini (2012)]. 



are positive), while nonsurgical treatment results 
in better attachment level (all estimates are neg- 
ative). As well, within each study, there is a con- 
sistently positive correlation between the two out- 
come effects: patients with a greater surgical vs. 
nonsurgical difference on one measure tend to have 
a greater such difference on the other, and greater 
within-study variation on PD than on AL.^*^ The 
overall sizes of the ellipses largely reflect (inversely) 
the sample sizes in the various studies. As far as 
we know, these results were not noted in previous 
analyses of these data. Finally, the fixed-effect es- 
timate, 3^^^ = (0.307,-0.394), and its covariance 
ellipse suggest that these effects are precisely esti- 
mated. 

The random-effects model is more complex be- 
cause (3 and A must be estimated jointly. A vari- 
ety of methods have been proposed (full maximum 
likelihood, restricted maximum likelihood, method 
of moments, Bayesian methods, etc.), whose details 
[for which see Jackson, Riley and White (2011)] are 
not relevant to the present discussion. Given an esti- 



^*Both PD and AL are measured on the same scale (mm), 
and the plots have been scaled to have unit aspect ratio, jus- 
tifying this comparison. 
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mate A, however, the pooled, population-averaged 
point estimate of effects under the random-effects 
model can be expressed as 

(42) 3^^^ = Xj-Er^X.) ' XjSr V.) , 

where = Sj -|- A. The first term in equation (42) 
gives the estimated covariance matrix V = Var(/3^^) 
of the random-effect pooled estimates. For the present 
example, this is shown in the right panel of Fig- 
ure 29 as the blue ellipse. The green ellipse shows 
the estimate of the between-study covariance. A, 
whose shape indicates that studies with a larger es- 
timate of PD also tend to have a larger estimate 
of AL (with correlation = 0.61). It is readily seen 
that, relative to the^fixed-effects estimate (3^^^, the 
unbiased estimate /3^^ = (0.353, —0.339) under the 
random-effects model has been shifted toward the 
centroid of the individual study estimates and that 
its covariance ellipse is now considerably larger, re- 
flecting between-study heterogeneity. In contrast to 
the fixed-effect estimates, inferences on Hq : /3^^ = 
pertain to the entire population of potential studies 
of these effects. 

Figure 29 (right) also shows the best linear un- 
biased predictions of individual study estimates and 
their associated covariance ellipses, superposed (pure- 
ly for didactic purposes) on the fixed-effects esti- 
mates to allow direct comparison. For random-effects 
models, the BLUPs have the form 

(43) 3f'''' = 3^'' + ASri(y.-3^''), 
with covariance matrices 

(44) Va^(3f'^"'')=V + (A- AS-^A). 

Algebraically, the BLUP outcome estimates in 
equation (43) are thus a weighted average of the po- 
pulation-averaged estimates and the study-specific 
estimates, with weights depending on the relative 
sizes of the within- and between-study covariance ma- 
trices Si and A. The point BLUPs borrow strength 
from the assumption of an underlying multivariate 
distribution of study parameters with covariance ma- 
trix A, shrinking toward the mean inversely pro- 
portional to the within-study covariance. Geometri- 
cally, these estimates may be described as occurring 
along the locus of osculation of the ellipses S{yi, Si) 
and f (3, A). 

Finally, the right panel of Figure 29 also shows the 
covariance ellipses of the BLUPs from equation (43). 
It is clear that their orientation is a blending of the 
correlations in V and Sj, and their size reflects the 



error in the average point estimates V and the error 
in the random deviation predicted for each study. 

7. DISCUSSION AND CONCLUSIONS 

I know of scarcely anything so apt to im- 
press the imagination as the wonderful form 
of cosmic order expressed by the "[Ellipti- 
cal] Law of Frequency of Error." The law 
would have been personified by the Greeks 
and deified, if they had known of it. . . . It 
is the supreme law of Unreason. When- 
ever a large sample of chaotic elements are 
taken in hand and marshaled in the order 
of their magnitude, an unsuspected and 
most beautiful form of regularity proves 
to have been latent all along. 

Sir Francis Galton, Natural Inheritance, 
London: Macmihan, 1889 ("[Elliptical]" 

added). 

We have taken the liberty to add the word "Ellip- 
tical" to this famous quotation from Galton (1889). 
His "supreme law of Unreason" referred to univari- 
ate distributions of observations tending to the Nor- 
mal distribution in large samples. We believe he 
would not take us remiss, and might perhaps wel- 
come us for extending this view to two and more 
dimensions, where ellipsoids often provide an "un- 
suspected and most beautiful form of regularity." 

In statistical data, theory and graphical meth- 
ods, one fundamental organizing distinction can be 
made depending on the dimensionality of the prob- 
lem. A coarse but useful scale considers the essential 
defining distinctions to be among: 

• ONE (univariate), 

• TWO (bivariate), 

• MANY (multivariate). 

This scale^^ at least implicitly organizes much of 
current statistical teaching, practice and software. 
But within this classification, the data, theory and 
graphical methods are often treated separately (ID, 
2D, nD), without regard to geometric ideas and vi- 
sualizations that help to unify them. 

This paper starts from the premise that one geo- 
metric form — the ellipsoid — provides a unifying 
framework for many statistical phenomena, with sim- 
ple representations in ID (a line) and 2D (an ellipse) 



This idea, as a unifying classification principle for data 
analysis and graphics, was first suggested to the first author 
in seminars by John Hartigan at Princeton, c. 1968. 
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that extend naturally to higher dimensions (an el- 
lipsoid). The intellectual leap in statistical think- 
ing from ONE to TWO in Galton (1886) was enor- 
mous. Galton's visual insights derived from the el- 
lipse quickly led to an understanding of the ellipse as 
a contour of a bivariate normal surface. From here, 
the step from TWO to MANY would take another 
20-30 years, but it is hard to escape the conclusion 
that geometric insight from the ellipse to the gen- 
eral ellipsoid in nD played an important role in the 
development of multivariate statistical methods. 

In this paper, we have tried to show how ellip- 
soids can be useful tools for visual statistical think- 
ing, data analysis and pedagogy in a variety of con- 
texts often treated separately and from a univari- 
ate perspective. Even in bivariate and multivariate 
problems, first-moment summaries (a ID regression 
line or 2 -|- D regression surface) show only part of 
the story — that of the expectation of a response y 
given predictors X. In many cases, the more inter- 
esting part of the story concerns the precision of 
various methods of estimation, which we've shown 
to be easily revealed through data ellipsoids and el- 
liptical confidence regions for parameters. 

The general relationships among statistical meth- 
ods, matrix algebra and geometry are not new here. 
To our knowledge, Dempster (1969) was the first to 
exploit these relationships in a systematic fashion, 
establishing the connections among abstract vector 
spaces, algebraic coordinate systems, matrix oper- 
ations and properties, the dualities between obser- 
vation space and variable space, and the geometry 
of ellipses and projections. The roots of these con- 
nections go back much further — to Cramer (1946) 
(the idea of the concentration ellipsoid), to Pearson 
(1901) and Hotelling (1933) (principal components), 
and, we maintain, ultimately to Galton (1886). 
Throughout this development, elliptical geometry 
has played a fundamental role, leading to important 
visual insights. 

The separate and joint roles of statistical compu- 
tation and computational graphics should not be un- 
derestimated in appreciation of these developments. 
Dempster's analysis of the connections among geom- 
etry, algebra and statistical methods was fueled by 
the development and software implementation of al- 
gorithms [Gram-Schmidt orthogonalization, Choles- 
ky decomposition, sweep and multistandardize op- 
erators from Beaton (1964)] that allowed him to 
show precisely the translation of theoretical rela- 
tions from abstract algebra to numbers and from 
there to graphs and diagrams. 



Monette (1990) took these ideas several steps fur- 
ther, developing interactive 3D graphics focused on 
linear models, geometry and ellipsoids, and demon- 
strating how many statistical properties and results 
could be understood through the geometry of ellip- 
soids. Yet, even at this later date, the graphical fa- 
cilities of readily available statistical software were 
still rather primitive, and 3D graphics was available 
only on high-end workstations. 

Several features of the current discussion may help 
to present these ideas in a new light. First, the exam- 
ples we have presented rely heavily on software for 
statistical graphics developed separately and jointly 
by all three authors. These have allowed us to create 
what we hope are compelling illustrations, all sta- 
tistically and geometrically exact, of the principles 
and ideas that form the body of the paper. More- 
over, these are now general methods, implemented in 
a variety of R packages, for example. Fox and Weis- 
berg (2011), Friendly (2007b), and a large collec- 
tion of SAS macros (http://datavis.ca/sasmac), 
so we hope this paper will contribute to turning the 
theory we describe into practice. 

Second, we have illustrated, in a wide variety of 
contexts, comprising all classical (Gaussian) linear 
models, multivariate linear models and several ex- 
tensions, how ellipsoids can contribute substantially 
to the understanding of statistical relationships, both 
in data analysis and in pedagogy. One graphical 
theme underlying a number of our examples is how 
the simple addition of ellipses to standard 2D graph- 
ical displays provides an efficient visual summary 
of important bivariate statistical quantities (means, 
variances, correlation, regression slopes, etc.). While 
first-moment visual summaries are now common ad- 
juncts to graphical displays in standard software, of- 
ten by default, we believe that the second-moment 
visual summaries of ellipses (and ellipsoids in 3D) 
now deserve a similar place in statistical practice. 

Finally, we have illustrated several recent or en- 
tirely new visualizations of statistical methods and 
results, all based on elliptical geometry. HE plots 
for MANOVA designs [Friendly (2007b)] and their 
projections into canonical space (Section 5) provide 
one class of examples where ellipsoids provide sim- 
ple visual summaries of otherwise complex statisti- 
cal results. Our analysis of the geometry of added 
variable-plots suggested the idea of superposing mar- 
ginal and conditional plots, as in Figure 18, lead- 
ing to direct visualization of the difference between 
marginal and conditional relationships in linear mod- 
els. The bivariate ridge trace plots described in Sec- 
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tion 6.3.1 are a direct outgrowth of the geometric ap- 
proach taken here, emphasizing the duaUty between 
views in data space and in parameter (/3) space. We 
beheve these all embody von Humboldt's (1811) dic- 
tum, quoted in the Introduction. 

APPENDIX: GEOMETRICAL AND 
STATISTICAL ELLIPSOIDS 

This appendix outhnes useful results and proper- 
ties concerning the representation of geometric and 
statistical ellipsoids. A number of these can be traced 
to or have more general descriptions within the ab- 
stract formulation of Dempster (1969), but casting 
them in terms of ellipsoids provides a simpler and 
more easily visualized framework. 

A.l Taxonomy and Representation of 
Generalized Ellipsoids 

Section 2.1 defined a proper (origin-centered) el- 
lipsoid in by i5 := {x : x"'"Cx < 1} that is bounded 
with nonempty interior (call these "fat" ellipsoids). 
For more general purposes, particularly for statis- 
tical applications, it is useful to give ellipsoids a 
wider definition. To provide a complete taxonomy, 
this wider definition should also include ellipsoids 
that may be unbounded in some directions in W 
(an infinite cylinder of ellipsoidal cross-section) and 
degenerate (singular) ellipsoids that are "flat" in W 
with empty interior, such as when a 3D ellipsoid 
has no extent in one dimension (collapsing to an 
ellipse), or in two dimensions (collapsing to a line). 
These ideas are made precise below with a definition 
of the signature, ^(C), of a generalized ellipsoid. 

The motivation for this more general representa- 
tion is to allow a notation for a class of generalized 
ellipsoids to be algebraically closed under operations 
(a) image and preimage under a linear transforma- 
tion, and (b) inversion. The goal is to be able to 
think about, visualize and compute a linear trans- 
formation of an ellipsoid with central matrix C or its 
inverse transformation via an analog of C~^, which 
applies equally to unbounded and/or degenerate el- 
lipsoids. Algebraically, the vector space of C is the 
dual of that of [Dempster (1969), Chapter 6] 
and vice-versa. Geometrical applications can show 
how points, lines and hyper planes in are all spe- 
cial cases of ellipsoids. Statistical applications con- 
cern the relationship between a predictor data ellip- 
soid and the corresponding f3 confidence ellipsoid 
(Section 4.6): The (3 ellipsoid will be unbounded 



(some linear combinations will have infinite confi- 
dence intervals) iff the corresponding data ellip- 
soid is flat, as when p > n or some predictors are 
collinear. 

Defining ellipsoids with {x:x"'^Cx< 1} produces 
proper ellipsoids for C positive definite and unbound- 
ed, fat ellipsoids for C positive semi-definite. But it 
does not produce degenerate (i.e., fiat) ellipsoids. On 
the other hand, the representation in equation (3), 
£ := AS, with S the unit sphere, produces proper 
ellipsoids when C = (A"'"A)~^ where A is a nonsin- 
gular p X p matrix and degenerate ellipsoids when 
A is a singular, but does not produce unbounded 
ellipsoids. 

One representation that works for all ellipsoids — 
fat or fiat and bounded or unbounded — can be based 
on a singular value decomposition (SVD) represen- 
tation A = UAV"^, with 

(A.l) £:=\J{AS), 

where U is orthogonal and A is diagonal with non- 
negative reals or infinity.^'' The "inverse" of an el- 
lipsoid £ is then simply U(A~^5). The connection 
with traditional representations is that, if A is fi- 
nite, A = UAV"*", where V can be any orthogonal 
matrix, and if A~^ is finite, C = UA^^U"*". 

The U(A5) representation also allows us to char- 
acterize any such generalized ellipsoid in W by its 
signature, 

g{C) = #[6^> 0,6^ = 0,6^ = 00] 

(A.2) 

with EOiC)=p, 

a 3-vector containing the number of positive, 
zero and infinite singular values. For example, in 
M^, any proper ellipsoid has the signature 0(0) = 
(3,0,0); a flat, 2D ellipsoid has ^(C) = (2,1,0); a 
flat, ID ellipsoid (a line) has ^(C) = (1,2,0). Un- 
bounded examples include an infinite flat plane, with 
G{C) = (0, 1, 2), and an infinite cylinder of elliptical 
cross-section, with ^(C) = (2,0, 1). 

Figure A.l illustrates these ideas, using two gen- 
erating matrices, Ci and C2, in this more general 
representation. 





"6 
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1" 




"6 
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, C2 = 
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3 







1 


2 


2 














Note that the parentheses in this notation are obliga- 
tory: A as defined transforms the unit sphere, which is then 
transformed by U. V^, also orthogonal, plays no role in this 
representation, because an orthogonal transformation of 5 is 
still a unit sphere. 
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Fig. A.l. Two views of an example of generalized ellipsoids. Ci (blue) determines a proper, fat ellipsoid; its inverse C]~^ 
also generates a proper ellipsoid. C2 (red) determines an improper, flat ellipsoid, whose inverse C^^ is an unbounded cylinder 
of elliptical cross section. The scale of these images is defined by a unit sphere (gray). The left panel shows that C2 is a 
projection of Ci onto the plane where z = 0. The right panel shows a view illustrating the orthogonality of each C and its 
dual, C~^. 



where Ci generates a proper ellipsoid and C2 gen- 
erates an improper, flat ellipsoid. Ci and its dual, 
, both have signatures (3,0,0). C2 has the sig- 
nature (2,1,0), while its inverse (dual) has the sig- 
nature (2, 0, 1). These varieties of ellipsoids are more 
easily understood in the 3D movies included in the 
online supplements. 

A. 2 Properties of Geometric Ellipsoids 

• Translation: An ellipsoid centered at xq has the 
definition <5 := {x: (x — xo)"'"C(x — xq) = 1} or 
£ ■.= xq(B as in the notation of Section 2.2. 

• Orthogonality: If C is diagonal, then the origin- 
centered ellipsoid has its axes aligned with the 
coordinate axes and has the equation 

(A. 3) Cx = Ciixl + C22xl -\ \-CppXp = l, 

where l/^/cii = c--^^'^ are the radii (semi-diameter 
lengths) along the coordinate axes. 

• Area and volume: In two dimensions, the area of 
the axis-aligned ellipse is vr(ciiC22)~^^^- For p = 
3, the volume is |'/r(ciiC22C33)^^/^. In the gen- 



eral case, the hypervolume of the ellipsoid is pro- 
portional to |C|~^/^ = ||A|| and is given by 
7rf/2det(C)-i/V[r(| + l)] [Dempster (1969), Sec- 
tion 3.5], where the first two factors are familiar 
as the normalizing constant of the multivariate 
normal density function. 
• Principal axes: In general, the eigenvectors, Vj, i = 
1, . . . ,p, of C define the principal axes of the el- 
lipsoid and the inverse of the square roots of the 
ordered eigenvalues, Ai > A2, • • • , Ap, are the prin- 
cipal radii. Eigenvectors belonging to eigenvalues 
that are are directions in which the ellipsoid is 
unbounded. With £ = AS, we consider the singu- 
lar-value decomposition A = UDV^, with U and 
V orthogonal matrices and D a diagonal nonneg- 
ative matrix with the same dimension as A. The 
column vectors of U, called the left singular vec- 
tors, correspond to the eigenvectors of C in the 
case of a proper ellipsoid. The positive diagonal 
elements of D, di > 1^2 > • • • > dp > 0, are the prin- 
cipal radii of the ellipsoid with di = In the 
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Fig. a. 2. Some properties of geometric ellipsoids, shown m 
2D. Principal axes of an ellipsoid are given by the eigenvec- 
tors ofC, with radii l/y/Xi. For an ellipsoid defined by equa- 
tion (1), the comparable ellipsoid for 2C has radii multiplied 
by l/\/2. The ellipsoid for C^^ has the same principal axes, 
but with radii y/Xi, making it small in the directions where C 
is large and vice-versa. 

singular case, the left singular vectors form a set 
of principal axes for the flattened ellipsoid. 

• Inverse: When C is positive definite, the eigenvec- 
tors of C and are identical, while the eigen- 
values of are l/Aj. It follows that the ellipsoid 
for has the same axes as that of C, but with 
inversely proportional radii. In M^, the ellipsoid 
for is, with rescaling, a 90° rotation of the 
ellipsoid for C, as illustrated in Figure A. 2. 

• Generalized inverse: A definition for an inverse 
ellipsoid that is equivalent in the case of proper 
ellipsoids, 



(A.4) 



£ 



-1 



{y:|xTy|<l,VxG£:}, 



generalizes to all ellipsoids. The inverse of a sin- 
gular ellipsoid is an improper ellipsoid and vice 
versa. 

Dimensionality: The ellipsoid is bounded if C is 
positive definite (all Aj > 0). Each Aj = increases 
the dimension of the space along which the el- 
lipsoid is unbounded by one. For example, with 



^^Corresponding left singular vectors and eigenvectors are 
not necessarily equal, but sets that belong to the same eigen- 
value/singular value span the same space. 



p = 3, A3 = gives a cylinder with an elliptical 
cross-section in 3-space, and A2 = A3 = gives an 
infinite slab with thickness 2^/Xl. With £ = AS, 
the dimension of the ellipsoid is equal to the num- 
ber of positive singular values of A. 
Projections: The projection of a p-dimensional el- 
lipsoid into any subspace is P£, where P is an 
idempotent pxp (projection) matrix, that is, PP = 
P^ = P. For example, in and M^, the matrices 



1 1 




1 
1 




project, respectively, an ellipse onto the line xi = 
X2 and an ellipsoid into the {xi,X2) plane. If P is 
symmetric, then P is the matrix of an orthogonal 
projection, and it is easy to visualize P£ as the 
shadow of £ cast perpendicularly onto span(P). 
Generally, P£ is the shadow of £ onto span(P) 
along the null space of P. 

• Linear transformations: A linear transformation 
of an ellipsoid is an ellipsoid, and the pre-image 
of an ellipsoid under a linear transformation is 
an ellipsoid. A nonsingular linear transformation 
maps a proper ellipsoid into a proper ellipsoid in 
the form shown in Section 2.2, equation (7). 

• Slopes and tangents: The slopes of the ellipsoidal 
surface in the directions of the coordinate axes 
are given by 9/(9x(x"'"Cx) = 2Cx. From this, it 
follows that the tangent hyperplane to the unit 
ellipsoidal surface at the point Xq, where xj9/ 
9x(x"'"Cx) = 0, has the equation xJCx = 1. 

A. 3 Conjugate Axes and Inner-Product Spaces 

For any nonsingular A in equation (5) that gener- 
ates an ellipsoid, the columns of A = [ai, a2, . . . , ap] 
form a set of "conjugate axes" of the ellipsoid. (Two 
diameters are conjugate ijf the tangent line at the 
endpoint of one diameter is parallel to the other di- 
ameter.) Each vector aj lies on the ellipsoid, and the 
tangent hyperplane at that point is parallel to the 
span of all the other column vectors of A. For p = 2 
this result is illustrated in Figure A. 3 (left) in which 



(A.5) 



A = [ ai a2 ' 



W = AA' 



1 1.5 

2 1 

3.25 3.5" 
3.5 5 
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Fig. a. 3. Conjugate axes of an ellipse with various factorizations of W and corresponding basis vectors. The conjugate 
vectors lie on the ellipse, and their tangents can be extended to form a parallelogram framing it. Left: for an arbitrary fac- 
torization, given in equation (A. 5). Right: for the Choleski factorization (solid, green, 61,62^ and the principal component 
factorization (dashed, brown, ci,C2). 



W 



-1 



Consider the inner-product space with inner prod- 
uct matrix 

1.25 -0.875" 
-0.875 0.8125 ^ 

and inner product (x,y) = x'W~"^y. 

Because A^W'^A = A"^(AA"^)"1A = A"^(A'^)-i • 
A~^A = I, we see that ai and a2 are orthogonal 
unit vectors (in fact, an orthonormal basis) in this 
inner product: 

(ai,ai) = a"''iW^^ai = 1, 

(ai,a2) =a'iW-ia2 = 0. 

Now, if W = BB"*" is any other factorization of 
W, then the columns of B have the same proper- 
ties as the columns of A. Particular factorizations 
yield interesting and statistically useful sets of con- 
jugate axes. The illustration in Figure A. 3 (right) 
shows two such cases with special properties: In the 
Choleski factorization (shown solid in green), where 
B is lower triangular, the last conjugate axis, b2, is 
aligned with the coordinate axis X2. Each previous 
axis (bi, here) is the orthogonal complement to all 
later axes in the inner-product space of W~^. The 
Choleski factorization is unique in this respect, sub- 
ject to a permutation of the rows and columns of 



W. The subspace {cibi -|- • • • -|- Cp_ibp_i,Cj G M} is 
the plane of the regression of the last variable on the 
others, a fact that generalizes naturally to ellipsoids 
that are not necessarily centered at the origin. 

In the principal-component (PC) factorization 
(shown dashed in brown in Figure A. 3, right) W = 
CC"^, where C = TA^/^ ^^^^ hence, W = TAT' is 
the spectral decomposition of W. Here, the ellipse 
axes are orthogonal in the space of the ellipse (so 
the bounding tangent parallelogram is a rectangle) 
as well as in the inner-product space of W~^. The 
PC factorization is unique in this respect (up to re- 
flections of the axis vectors). 

As illustrated in Figure A. 3, each pair of conjugate 
axes has a corresponding bounding tangent parallel- 
ogram. It can be shown that all such parallelograms 
have the same area and equal sums of squares of the 
lengths of their diameters. 

A. 4 Ellipsoids in a Generalized Metric Space 

In Appendix A. 3, we considered the positive semi- 
definite matrix W and corresponding ellipsoid to be 
referred to a Euclidean space, perhaps with different 
basis vectors. We showed that various measures of 
the "size" of the ellipsoid could be defined in terms 
of functions of the eigenvalues Aj of W. 

We now consider the generalized case of an anal- 
ogous p X p positive semi-definite symmetric matrix 




H, but where measures of length, distance and an- 
gles are referred to a metric defined by a positive- 
definite symmetric matrix E. As is well known, the 
generalized eigenvalue problem is to find the scalars 
Xi and vectors = l,2,...,p, such that Hv = 
AEv, that is, the roots of det(H - AE) = 0. 

For such H and E, we can always find a factor 
A of E, so that E = AA""", whose columns will be 
conjugate directions for E and whose rows will also 
be conjugate directions for H, in that H = A"'"DA, 
where D is diagonal. Geometrically, this means that 
there exists a unique pair of bounding parallelo- 
grams for the H and E ellipsoids whose correspond- 
ing sides are parallel. A linear transformation of E 
and H that transforms the parallelogram for E to 
a square (or cuboid), and hence E to a circle (or 
spheroid), generates an equivalent view in what we 
describe below as canonical space. 

In statistical applications (e.g., MANOVA, canon- 
ical correlation), the generalized eigenvalue problem 
is transformed to an ordinary eigenvalue problem by 
considering the following equivalent forms with the 
same Aj, Vj: 

(H - AE)v = 0, 

(HE^^ - AI)v = 0, 

{E~^/^IIB~^/^ -XI)v = 0, 

where the last form gives a symmetric matrix, H* = 
E-1/2HE-1/2, Using the square root of E defined by 
the principal-component factorization E^/^ = FA^^^ 
produces the ellipsoid H*, the orthogonal axes of 
which correspond to the Vj, whose squared radii are 



the corresponding eigenvalues Aj. This can be seen 
geometrically as a rotation of "data space" to an ori- 
entation defined by the principal axes of E, followed 
by a re-scaling, so that the E ellipsoid becomes the 
unit spheroid. In this transformed space ("canonical 
space" ) , functions of the squared radii Aj of the axes 
of H* give direct measures of the "size" of H rela- 
tive to E. The orientation of the eigenvectors Vj can 
be related to the (orthogonal) linear combinations 
of the data variables that are successively largest in 
the metric of E. 

To illustrate. Figure A. 4 (left) shows the ellipses 
generated by 



H 



and E 



1 

0.5 



0.5 

2 



together with their conjugate axes. For E, the con- 
jugate axes are defined by the columns of the right 
factor, a"*", in E = AA"""; for H, the conjugate axes 
are defined by the columns of A. The transforma- 
tion to H* = E~^/^HE~^/^ is shown in the right 
panel of Figure A. 4. In this "canonical space," an- 
gles and lengths have the ordinary interpretation of 
Euclidean space, so the size of H* can be interpreted 
directly in terms of functions of the radii -v/Ai and 
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SUPPLEMENTARY MATERIAL 

Supplementary materials for Elliptical insights: 
Understanding statistical methods through ellipti- 
cal geometry (DOL 10.1214/12-STS402SUPP; .pdf). 
The supplementary materials include SAS and R 
scripts to generate all of the figures for this arti- 
cle. Several 3D movies are also included to show 
phenomena better than can be rendered in static 
print images. A new R package, gellipsoid, provides 
computational support for the theory described in 
Appendix A.l. These are also available at http:// 
datavis . ca/papers/ellipses and described in 
http : //datavis . ca/papers/ellipses/supp.pdf 
[Friendly, Monette and Fox (2012)]. 
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